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ABSTRACT 


Over  the  past  several  years  there  has  been  considerable  attention  focussed 
on  the  problem  of  enhancement  and  bandwidth  compression  of  speech  degraded  by 
additive  background  noise.  This  interest  is  motivated  by  several  factors 
including  a broad  set  of  important  applications,  the  apparent  lack  of 
robustness  in  current  speech  compression  systems  and  the  development  of 
several  potentially  promising  and  practical  solutions.  One  objective  of  this 
paper  is  to  provide  an  overview  of  the  variety  of  techniques  that  have  been 
proposed  for  enhancement  and  bandwidth  compression  of  speech  degraded  by 
additive  background  noise.  A second  objective  is  to  suggest  a unifying 
framework  in  terms  of  which  the  relationships  betwen  these  systems  is  more 
visible  and  which  hopefully  will  provide  a structure  which  will  suggest 
fruitful  directions  for  further  research. 
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ENHANCEMENT  AND  BANDWIDTH  COMPRESSION  OF  NOISY  SPEECH 


I . INTRODUCTION 

There  are  a wide  variety  of  contexts  in  which  it  is  desired  to  enhance 
speech.  The  objective  of  enhancement  may  perhaps  be  to  improve  the  overall 
quality,  to  increase  intelligibility,  to  reduce  listener  fatigue,  etc. 
Depending  on  the  specific  application,  the  enhancement  system  may  be  directed 
at  only  one  of  these  objectives  or  several.  For  example  a speech 
communication  system  may  introduce  a low  amplitude  long  time  delay  echo  or  a 
narrowband  additive  disturbance  and  while  these  degradations  may  not  by 
themselves  reduce  intelligibility  for  the  purposes  for  which  the  channel  is 
used,  they  are  generally  objectionable  and  an  improvement  in  quality  perhaps 
even  at  the  expense  of  some  intelligibility  may  be  desirable.  Another  example 
is  the  communication  between  a pilot  and  an  air  traffic  control  tower.  In 
this  environment,  the  speech  is  typically  degraded  by  background  noise.  Of 
central  importance  is  the  intelligibility  of  the  speech  and  it  would  generally 
be  acceptable  to  sacrifice  quality  if  the  intelligibility  could  be  improved. 
Even  with  normal  undegraded  speech,  it  is  sometimes  useful  or  desirable  to 
provide  enhancement.  As  a simple  example  high  pass  filtering  of  normal  speech 
is  often  used  to  introduce  a "crispness"  which  is  generally  perceived  as  an 
improvement  in  quality. 

The  speech  enhancement  problem  covers  a broad  spectrum  of  constraints, 
applications  and  issues.  Environments  in  which  an  additive  background  signal 
has  been  introduced  are  common.  The  background  may  be  noise-like  such  as  in 
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aircraft,  street  noise  etc.  or  may  be  speech-like  such  as  an  environment  with 
competing  speakers.  Other  examples  in  which  the  need  for  speech  enhancement 
arises  include  correcting  for  the  distortion  of  the  speech  of  underwater 
divers  breathing  a helium-oxygen  mixture,  and  correcting  the  distortion  of 
speech  due  to  pathological  difficulties  of  the  speaker  or  introduced  due  to  an 
attempt  to  speak  too  rapidly.  Even  for  these  examples,  the  problem  and 
techniques  vary,  depending  on  the  availability  of  other  signals  or 
information.  For  example,  for  enhancement  of  speech  in  an  aircraft  a separate 
microphone  can  be  used  to  monitor  the  background  noise  so  that  the 
characteristics  of  the  noise  can  be  used  to  adjust  or  adapt  the  enhancement 
system  . At  the  air  traffic  control  tower  however  the  only  signal  available 
for  enhancement  is  the  degraded  speech. 

Another  very  important  application  for  speech  enhancement  is  in 
conjunction  with  speech  bandwidth  compression  systems.  Because  of  the 
increasing  role  of  digital  communication  channels,  the  need  for  encrypting  of 
speech  and  increased  emphasis  on  integrated  voice-data  networks,  speech 
bandwidth  compression  systems  are  destined  to  play  an  increasingly  important 
role  in  speech  communication  systems.  The  conceptual  basis  for  narrowband 
speech  compression  systems  stems  from  a model  for  the  speech  signal  based  on 
what  is  known  about  the  physics  and  physiology  of  speech  production.  Because 
of  this  reliance  on  a model  for  the  signal  it  is  not  unreasonable  to  expect 
that  as  the  signal  deviates  from  the  model,  due  to  distortion  such  as  additive 
noise,  the  performance  of  the  speech  compression  system  with  regard  to  factors 
such  as  quality,  intelligibility  etc.  will  degrade.  It  is  generally  agreed 
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that  the  performance  of  current  speech  compression  systems  degrades  rapidly  in 
the  presence  of  additive  noise  and  other  distortions  and  there  is  currently 
considerable  interest  and  attention  being  directed  at  the  development  of  more 
robust  speech  compression  systems.  There  are  two  basic  approaches  which  are 
typically  considered  either  of  which  may  be  preferable  in  a given  situation. 
One  approach  is  to  base  the  bandwidth  compression  on  the  assumption  of 
undistorted  speech  and  develop  a pre-processor  to  enhance  the  degraded  speech 
in  preparation  for  further  processing  by  the  bandwidth  compression  system.  It 
is  important  to  recognize  that  in  enhancing  speech  in  preparation  for 
bandwidth  compression  the  effectiveness  of  the  pre-processor  is  judged  on  the 
basis  of  the  output  of  the  bandwidth  compression  system  in  comparison  with  the 
output  if  no  pre-processor  is  used.  Thus,  for  example,  it  is  possible  that 
the  output  of  the  preprocessor  would  be  judged  by  a listener  to  be  inferior 
(by  some  measure)  to  the  input  but  that  the  output  of  the  bandwidth 
compression  system  with  the  preprocessor  is  preferred  to  the  output  without 
it.  In  this  case,  the  preprocessor  would  clearly  be  considered  to  be 
effective  in  enhancing  the  speech  in  preparation  for  bandwidth  compression. 
Another  approach  to  bandwidth  compression  of  degraded  speech  is  to  incorporate 
into  the  model  for  the  signal  information  about  the  degradation.  A number  of 
systems  based  on  such  an  approach  have  recently  been  proposed  and  will  be 
discussed  in  detail  in  this  paper. 

As  is  evident  from  the  above  discussion,  the  general  problem  of  enhancing 
speech  is  broad  and  the  constraints,  information,  and  objectives  are  heavily 
dependent  on  the  specific  context  and  applications.  In  this  paper  we  consider 
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only  a small  subset  of  possible  topics,  specifically  the  enhancement  and 
bandwidth  compression  of  speech  degraded  by  additive  noise.  Furthermore  we 
assume  that  the  only  signal  available  is  the  degraded  speech  and  that  the 
noise  does  not  depend  on  the  original  speech.  Many  practical  problems,  some 
of  which  have  already  been  discussed,  fall  into  this  framework  and  some 
problems  that  do  not  can  be  transformed  so  that  they  do.  For  example, 
multiplicative  noise  or  convolutional  noise  degradation  can  be  converted  to  an 
additive  noise  degradation  by  a homomorphic  transformation  (1,2).  In  another 
example,  signal  dependent  quantization  noise  in  PCM  signal  coding  can  be 
converted  to  a signal  independent  additive  noise  by  a pseudo-noise  technique 
(3,4,5). 

Even  within  the  limited  framework  outlined  above,  there  is  a diversity  of 
approaches  and  systems.  One  objective  of  this  paper  is  to  provide  an  overview 
of  the  variety  of  techniques  that  have  been  proposed  for  enhancement  of  speech 
degraded  by  additive  background  noise  both  for  direct  listening  and  as  a 
preprocessor  for  subsequent  bandwidth  compression.  Many  of  these  systems  were 
developed  independently  of  each  other  and  on  the  surface  often  appear  to  be 
unrelated.  Thus,  another  objective  of  the  paper  is  to  provide  a unifying 
framework  in  terms  of  which  the  relationship  between  these  systems  is  more 
visible,  and  which  hopefully  will  provide  a structure  which  will  suggest 
further  fruitful  directions  for  research. 

In  Section  II  of  this  paper  we  present  an  overview  of  the  general  topic. 

In  this  overview  we  classify  the  various  enhancement  systems  based  on  the 
information  assumed  about  the  speech  and  the  noise.  Some  systems  based  on 
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time- invariant  Wiener  filtering,  for  example,  rely  only  on  an  assumed  noisy 
power  spectrum  and  on  long-time  average  characteristics  of  speech,  such  as  the 
fact  that  the  average  speech  spectrum  decays  with  frequency  at  approximately 
6db/octave.  Other  systems  rely  on  aspects  of  speech  perception  or  speech 
production  in  general  or  on  a detailed  model  of  speech. 

Sections  III,  IV  and  V present  a more  detailed  discussion  of  several  of 
these  categories  of  speech  enhancement  systems.  In  particular.  Section  III  is 
concerned  with  the  general  principle  of  speech  enhancement  based  on  estimation 
of  the  short-time  spectral  amplitude  of  the  speech.  This  basic  principle 
encompasses  a variety  of  techniques  and  systems  including  the  specific  methods 
of  spectral  subtraction,  parametric  Wiener  filtering,  etc.  In  Section  IV 
speech  enhancement  techniques  which  rely  principally  on  the  concept  of  the 
short-time  periodicity  of  voiced  speech  are  reviewed,  including  comb  filtering 
and  related  systems.  Section  V discusses  a variety  of  systems  that  rely  on 
more  specific  modelling  of  the  speech  waveform.  As  we  will  discuss  in  detail, 
in  some  cases,  parameters  of  the  model  are  obtained  from  an  analysis  of  the 
degraded  speech  and  used  to  synthesize  the  enhanced  speech.  In  other  cases, 
the  results  of  an  analysis  based  on  a model  for  speech  are  used  to  control  an 
enhancement  filter,  perhaps  with  the  procedure  being  iterative  so  that  the 
output  of  an  enhancement  filter  is  then  subjected  to  further  analysis,  etc. 
Many  of  these  systems  also  incorporate  a number  of  the  techniques  introduced 
in  Section  III,  including  Wiener  filtering  and  spectral  subtraction. 

In  Sections  III,  IV  and  V the  focus  is  entirely  on  systems  for  enhancement 
with  the  evaluation  of  the  systems  being  based  on  listening  without  further 
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In  Section  VII  we  discuss  in  some  detail  the  evaluation  of  the  performance 
of  the  various  systems  presented  in  the  earlier  Sections.  In  general  the 
performance  evaluation  of  a speech  enhancement  system  is  extremely  difficult, 
in  large  measure  because  the  appropriate  criteria  for  evaluation  are  heavily 
dependent  on  the  specific  application  of  the  system.  Relative  importance  of 
such  factors  as  quality,  intelligibility,  listener  fatigue,  etc.,  may  vary 
considerably  with  the  application.  In  Section  VII  we  summarize  the 
performance  evaluations  that  have  been  reported  for  the  various  systems 
presented  in  this  paper.  Since  the  evaluation  of  different  systems  has 
generally  been  based  on  different  procedures,  environments,  etc.,  no  attempt 
is  made  in  the  section  to  compare  individual  systems.  In  general,  however,  we 
will  see  that  while  many  of  the  enhancement  systems  reduce  the  apparent 
background  noise  and  thus  perhaps  increase  quality,  many  of  thei?  to  varying 
degrees,  reduce  intelligibility.  In  the  context  of  bandwidth  compression, 
however,  a number  of  systems  provide  an  increase  in  intelligibility  over  that 
obtained  without  the  incorporation  of  speech  enhancement. 
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II.  OVERVIEW  OF  SYSTEMS  FOR  ENHANCEMENT  AND  BANDWIDTH  COMPRESSION  OF  NOISY 


SPEECH. 

As  indicated  in  the  previous  Section  our  focus  in  this  paper  is  on 
degradation  due  to  the  presence  of  additive  noise.  Even  within  this  limited 
context  there  are  a wide  variety  of  approaches  which  have  been  proposed  and 
explored.  Conceptually  any  approach  should  attempt  to  capitalize  on  available 
information  about  the  signal  i.e.  the  speech,  and  the  background  noise. 

Speech  is  a special  subclass  of  audio  signals  and  there  are  reasonable  models 
in  terms  of  which  the  speech  waveform  can  be  described  and  categorized.  The 
more  specifically  we  attempt  to  model  the  speech  signal,  the  more  potential 
for  separating  it  from  the  background  noise.  On  the  other  hand,  the  more  we 
assume  about  the  speech  the  more  sensitive  the  enhancement  system  will  be  to 
inaccuracies  or  deviations  from  these  assumptions.  Thus,  incorporating 
assumptions  and  information  about  the  speech  signal  represents  tradeoffs  which 
are  reflected  in  the  various  systems.  In  a similar  manner  systems  can  attempt 
to  incorporate  detailed  information  about  the  background  noise.  For  example, 
the  type  of  processing  suggested  if  the  background  noise  is  a competing 
speaker  is  different  than  if  it  is  wideband  random  noise.  Thus  enhancement 
systems  also  tend  to  differ  in  terms  of  the  assumptions  made  regarding  the 
background  noise.  As  with  assumptions  related  to  the  signal,  the  more  an 
enhancement  system  attempts  to  capitalize  on  assumed  characteristics  of  the 
noise  the  more  susceptible  it  is  likely  to  be  to  deviations  from  these 

i 

assumptions. 

Another  important  consideration  in  speech  enhanceme;  : stems  from  the  fact 
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that  the  criteria  for  enhancement  ultimately  relate  to  an  evaluation  by  a 
human  listener.  In  different  contexts  the  criteria  for  evaluation  may  differ 
depending  on  whether  quality,  intelligibility  or  some  other  attribute  is  the 
most  important.  Thus  speech  enhancement  must  inevitably  take  into  account 
aspects  of  human  perception.  As  we  will  indicate  shortly,  some  systems  are 
heavily  motivated  by  perceptual  considerations,  others  rely  more  on 
mathematical  criteria.  In  such  cases,  of  course,  the  mathematical  criteria 
must  in  some  way  be  consistent  with  human  perception,  and,  while  an  optimum 
mathematical  criterion  is  not  known,  some  mathematical  error  criteria  are 
understood  to  be  a better  match  than  others  to  aspects  of  human  perception. 

In  the  followng  discussion  we  briefly  describe  some  aspects  of  speech 
production  and  speech  perception  that  in  varying  degrees  play  a role  in  speech 
enhancement  systems.  Following  that  we  present  a brief  overview  of  a 
representative  collection  of  speech  enhancement  systems,  with  the  intent  of 
categorizing  these  systems  in  terms  of  the  various  aspects  of  speech 
production  and  perception  on  which  they  attempt  to  capitalize. 

Speech  is  generated  by  exciting  an  acoustic  cavity,  the  vocal  tract,  by 
pulses  of  air  released  through  the  vocal  cords  for  voiced  sounds,  or  by 
turbulence  for  unvoiced  sounds.  Thus  a simple  but  useful  model  for  speech 
production  consists  of  a linear  system,  representing  the  vocal  tract,  driven 
by  an  excitation  function  which  is  a periodic  pulse  train  for  voiced  sounds 
and  wideband  noise  for  unvoiced  sounds,  as  illustrated  in  Figure  1. 
Furthermore,  since  the  linear  system  represents  an  acoustic  cavity,  its 
response  is  of  a resonant  nature,  so  that  its  transfer  function  is 
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characterized  by  a set  of  resonant  frequencies,  referred  to  as  formants,  as 
illustrated  in  Figure  2(a).  Thus,  if  the  excitation  and  vocal  tract 
parameters  are  fixed,  then  as  indicated  in  Figure  2(b),  the  speech  spectrum 
has  an  envelope  representing  the  vocal  tract  transfer  function  of  Figure  2(a) 
and  a fine  structure  representing  the  excitation. 

Many  of  the  techniques  for  speech  enhancement,  particularly  those  in 
Sections  III  and  V are  conceptually  based  on  the  representation  of  the  speech 
signal  as  a stochastic  process.  This  characterization  of  speech  is  clearly 
more  appropriate  in  the  case  of  unvoiced  sounds  for  which  the  vocal  tract  is 
driven  by  wideband  noise.  The  vocal  tract  of  course  changes  shape  as 
different  sounds  are  generated  and  this  is  reflected  in  a time  varying 
transfer  function  for  the  linear  system  in  Figure  1.  However,  because  of  the 
mechanical  and  physiological  constraints  on  the  motion  of  the  vocal  tract  and 
articulators  such  as  the  tongue  and  lips,  it  is  reasonable  to  represent  the 
linear  system  in  Figure  1 as  a slowly  varying  linear  system  so  that  on  a 
short-time  basis  it  is  approximated  as  stationary.  Thus,  some  specific 
attributes  of  the  speech  signal,  which  can  be  capitalized  on  in  an  enhancement 
system  are  that  it  is  the  response  of  a slowly  varying  linear  system,  that  on 
a short-time  basis  its  spectral  envelope  is  characterized  by  a set  of 
resonances,  and  that  for  voiced  sounds,  on  a short-time  basis  it  has  a 
harmonic  structure.  This  simplified  model  for  speech  production  has  generally 
been  very  successful  in  a variety  of  engineering  contexts  including  speech 
enhancement,  synthesis  and  bandwidth  compression.  A more  detailed  discussion 
of  models  for  speech  production  can  be  found  in  (6,7,8). 
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The  perceptual  aspects  of  speech  are  considerably  more  complicated  and 
less  well  understood.  However,  there  are  a number  of  commonly  accepted 
aspects  of  speech  perception  which  play  an  important  role  in  speech 
enhancement  systems.  For  example,  consonants  are  known  to  be  important  in  the 
intelligibility  of  speech  even  though  they  represent  a relatively  small 
fraction  of  the  signal  energy.  Furthermore,  it  is  generally  understood  that 
the  short-time  spectrum  is  of  central  importance  in  the  perception  of  speech 
and  that,  specifically,  the  formants  in  the  short-time  spectrum  are  more 
important  than  other  details  of  the  spectral  envelope.  It  appears  also,  that 
the  first  formant,  typically  in  the  range  of  250  Hz  to  800  Hz,  is  considerably 
less  important  perceptually,  than  the  second  formant  (9,10).  Thus  it  is 
possible  to  apply  a certain  degree  of  high  pass  filtering  (11,12)  to  speech 
which  may  perhaps  affect  the  first  formant  without  introducing  serious 
degradaton  in  intelligibility.  Similarly  low  pass  filtering  with  a cutoff 
frequency  above  4KHz,  while  perhaps  affecting  crispness  and  quality  will  in 
general  not  seriously  affect  intelligibility.  A good  representation  of  the 
magnitude  of  the  short-time  spectrum  is  also  generally  considered  to  be 
important  whereas  the  phase  is  relatively  unimportant.  Another  perceptual 
aspect  of  the  auditory  system  that  plays  a role  in  speech  enhancement  is  the 
ability  to  mask  one  signal  with  another.  Thus,  for  example,  narrowband  noise 
and  many  forms  of  artificial  noise  or  degradation  such  as  might  be  produced  by 
a vocoder  are  more  unpleasant  to  listen  to  than  broadband  noise  and  a speech 
enhancement  system  might  include  the  introduction  of  broadband  noise  to  mask 
the  narrowband  or  artificial  noise. 
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All  speech  enhancement  systems  rely  to  varying  degrees  on  the  aspects  of 
speech  production  and  perception  outlined  above.  Che  of  the  simplest 
approaches  to  enhancement  is  the  use  of  lowpass  or  bandpass  filtering  to 
attenuate  the  noise  outside  the  band  of  perceptual  importance  for  speech. 

More  generally,  when  the  power  spectrum  of  the  noise  is  known,  one  can 
consider  the  use  of  Wiener  filtering,  based  on  the  long-time  power  spectrum  of 
speech.  While  in  some  cases  such  as  the  presence  of  narrowband  background 
noise,  this  is  reasonably  successful,  Wiener  filtering  based  on  the  long-time 
power  spectrum  of  the  speech  and  noise  is  limited  because  speech  is  not 
stationary.  Even  if  speech  were  truly  stationary,  mean  square  error,  which  is 
the  error  criterion  on  which  Wiener  filtering  is  based  is  not  strongly 
correlated  with  perception  and  thus  is  not  a particularly  effective  error 
criterion  to  apply  to  speech  processing  systems.  This  is  evidenced,  for 
example,  in  the  use  of  masking  for  enhancement.  By  adding  broadband  noise  to 
mask  other  degradation,  we  are,  in  effect,  increasing  the  mean  square  error. 
Another  example  that  suggests  that  mean  square  error  is  not  well  matched  to 
the  perceptually  important  attributes  in  speech  is  the  fact  that  distortion  of 
the  speech  waveform  by  processing  with  an  all-pass  filter  results  in 
essentially  no  audible  difference  if  the  impulse  response  of  the  all  pass 
filter  is  reasonably  short  but  can  result  in  a substantial  mean  square  error 
between  the  original  and  filtered  speech.  In  other  words,  mean  square  error 
is  sensitive  to  phase  of  the  spectrum  whereas  perception  tends  not  to  be. 

Masking  and  bandpass  filtering  represent  two  simple  ways  in  which 
perceptual  aspects  of  the  auditory  system  can  be  exploited  in  speech 
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enhancement.  Another  system  whose  motivation  depends  heavily  on  aspects  of 
speech  perception  was  proposed  by  Thomas  and  Niederjohn  (12)  as  a pre- 
processor prior  to  the  introduction  of  noise  in  those  applications  where 
noise-free  speech  is  available  for  processing.  In  essence,  their  system 
applies  high  pass  filtering  to  reduce  or  remove  the  first  formant  followed  by 
infinite  clipping.  The  motivation  for  the  system  lies  in  the  observation  that 
at  a given  signal  to  noise  ratio  infinite  clipping  will  increase  relative  to 
the  vowels  the  amplitude  of  the  perceptually  important  low  amplitude  events 
such  as  consonants  thus  making  them  less  susceptible  to  masking  by  noise.  In 
addition,  for  vowels  the  filtering  will  increase  the  amplitude  of  higher 
formants  relative  to  the  first  formant,  thus  making  the  perceptually  more 
important  higher  formants  less  susceptible  to  degradation.  In  the  speech 
enhancement  problem  considered  in  this  paper,  noise-free  speech  is  not 
available  for  processing  as  required  in  the  above  system.  Thomas  and 
Ravindran  (13),  however,  applied  high-pass  filtering  followed  by  infinite 
clipping  to  noisy  speech  as  an  experiment.  While  quality  may  be  degraded  by 
the  process  of  filtering  and  clipping,  they  claim  a noticeable  improvement  in 
intelligibility  when  applied  to  enhance  speech  degraded  by  wide-band  random 
noise.  One  possible  explanation  may  be  that  the  high-pass  filtering  operation 
reduces  the  masking  of  perceptually  important  higher  formants  by  the 
relatively  unimportant  low  frequency  components. 

Another  system  which  relies  heavily  on  human  perception  of  speech  was 
proposed  by  Drucker  (14).  Based  on  some  perceptual  tests,  Drucker  concluded 
that  one  primary  cause  for  the  intelligibility  loss  in  speech  degraded  by 
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wide-band  random  noise  is  the  confusion  among  the  fricative  and  plosive  sounds 
which  is  partly  due  to  the  loss  of  short  pauses  immediately  before  the  plosive 
sounds.  By  high-pass  filtering  one  of  the  fricative  sounds,  the  /s/  sound, 
and  inserting  short  pauses  before  the  plosive  sounds  (assuming  that  their 
locatons  can  be  accurately  determined),  Drucker  claims  a significant 
improvement  in  intelligibility. 

In  discussing  perceptual  attributes  we  indicated  that  the  short-time 
spectral  magnitude  is  generally  considered  to  be  important  whereas  the  phase 
is  telatively  unimportant.  This  forms  the  basis  for  a class  of  speech 
enhancement  systems  which  attempt  in  various  ways  to  estimate  the  short-time 
spectral  magnitude  of  the  speech  without  particular  regard  to  the  phase  and  to 
use  this  to  recover  or  reconstruct  the  speech.  This  class  of  systems  includes 
spectral  subtraction  techniques  originally  due  to  Weiss,  et  al  (15,16),  and 
which  have  recently  received  a great  deal  of  attention,  (17,18,19,20,21,22) 
and  optimum  filtering  techniques  such  as  Wiener  filtering  and  power  spectrum 
filtering.  These  systems  will  be  discussed  in  considerable  detail  in  Section 
III.  As  we  will  see,  many  of  these  systems  which  appear  on  the  surface  to  be 
different  are  in  fact  identical  or  very  closely  related. 

In  addition  to  directly  or  indirectly  utilizing  perceptual  attributes  most 
enhancement  systems  rely  to  varying  degrees  on  aspects  of  speech  production. 
For  example,  in  Section  IV,  we  describe  in  detail  a variety  of  systems  that 
attempt,  in  some  way,  to  capitalize  on  short-time  periodicity  of  speech  during 
voiced  sounds.  As  a consequence  of  this  periodicity,  during  voiced  intervals 
the  speech  spectrum  has  a harmonic  structure  which  suggests  the  possibility  of 
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applying  comb  filtering  or  as  proposed  by  Parsons  (23)  attempting  to  extract 
in  other  ways,  the  components  of  the  speech  spectrum  only  at  the  harmonic 
frequencies.  In  essence,  knowledge  of  the  harmonic  structure  of  voiced  sounds 
allows  us  in  principle  to  remove  the  noise  in  the  spectral  bands  between  the 
harmonics. 

As  discussed  in  Section  IV  speech  enhancement  by  comb  filtering  can  also 
be  viewed  in  terms  of  averaging  successive  periods  of  the  noisy  speech  to 
partially  cancel  the  noise.  Another  system,  which  attempts  to  take  advantage 
of  the  quasi-periodic  nature  of  the  speech  was  proposed  by  Sambur  (24) . As 
developed  in  more  detail  in  Section  IV,  his  system  is  based  on  the  principles 
of  adaptive  noise  cancelling.  Unlike  the  classical  procedure  Sambur's  method 
is  designed  to  cancel  out  the  clean  speech  signal,  taking  advantage  of  the 
quasi-periodic  nature  of  the  speech  to  form  an  estimate  of  the  speech  at  each 
time  instant  from  the  value  of  the  signal  one  period  earlier. 

In  the  model  of  speech  production,  we  represented  the  speech  signal  as 
generated  by  exciting  a quasi- stationary  linear  system  with  a pulse  train  for 
voiced  speech  and  noise  for  unvoiced  speech.  Based  on  this  model,  an  approach 
to  speech  enhancement  is  to  attempt  to  estimate  parameters  of  the  model  rather 
than  the  speech  itself  and  to  then  use  this  to  synthesize  the  speech,  i.e.,  to 
enhance  speech  through  the  use  of  an  analysis- synthes is  system.  A 
particularly  novel  application  of  this  concept  was  used  by  Stockham  and  Miller 
(25)  to  remove  the  orchestral  accompaniment  from  early  recordings  of  Enrico 
Caruso.  In  this  system  homomorphic  deconvolution  was  used  to  estimate  the 
impulse  response  of  the  model  in  Figure  1.  A similar  approach  to  noise 
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reduction  was  proposed  by  Suzuki  (26,27)  whereby  the  short-time  correlation 
function  of  the  degraded  speech  is  used  as  an  estimate  of  the  impulse  response 
of  the  linear  system.  This  system  is  referred  to  as  SPAC  (Splicing  of 
Autocorrelation  Function).  A modification  of  SPAC  is  referred  to  as  SPOC 
(Splicing  of  Crosscorrelation  Coefficients) . A number  of  systems  also  attempt 
to  model  in  more  detail,  the  vocal-tract  impulse  response.  As  we  discussed 
previously  the  vocal-tract  transfer  function  is  characterized  by  a set  of 
resonances  or  formants  that  are  perceptually  important.  This  suggests  the 
possibility  of  representing  the  vocal-tract  impulse  response  in  terms  of  a 
pole- zero  model  with  the  analysis  procedure  directed  at  estimating  the 
associated  parameters.  The  poles  in  particular  would  provide  a reasonable 
representation  of  the  formants. 

All-pole  modelling  of  speech  has  had  notable  success  in  analysis/synthesis 
systems  for  clean  speech.  A number  of  recent  efforts  have  been  directed 
toward  estimating  the  parameters  in  an  all-pole  model  from  noisy  observations 
of  the  speech  such  as  the  systems  by  Magill  and  Un  (28),  Lim  and  Oppenheim 
(29),  and  Lim  (28).  Extensions  to  pole-zero  modelling  have  also  been  proposed 
by  Musicus  and  Lim  (30)  and  Musicus  (31) . These  various  approaches  are 
described  and  compared  in  detail  in  Section  V. 

The  above  discussion  was  intended  as  a brief  overview  of  the  general 
approaches  to  speech  enhancement.  In  the  next  three  sections  we  explore  in 
more  detail  many  of  the  systems  mentioned  above.  In  particular,  in  Section 
III,  we  focus  on  speech  enhancement  techniques  based  on  short-time  spectral 
amplitude  estimation.  In  Section  IV  our  focus  is  on  speech  enhancement  based 
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on  periodicity  of  voiced  speech  and  in  Section  V on  speech  enhancement 
techniques  using  an  analysis-synthesis  procedure. 
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III.  SPEECH  ENHANCEMENT  TECHNIQUES  BASED  ON  SHORT-TIME 

SPECTRAL  AMPLITUDE  ESTIMATION 

In  general,  in  enhancement  of  a signal  degraded  by  additive 
noise,  it  is  significantly  easier  to  estimate  the  spectral  amplitude 
associated  with  the  original  signal  than  it  is  to  estimate  both  amplitude 
and  phase.  As  we  discussed  in  Section  II,  it  is  principally  the  short- 
time  spectral  amplitude  rather  than  phase  that  is  important  for  speech 
intelligibility  and  quality.  As  we  discuss  in  this  section,  there  are  a 
variety  of  speech  enhancement  techniques  that  capitalize  on  this  aspect 
of  speech  perception  by  focusing  on  enhancing  only  the  short-time  spectral 
amplitude.  The  techniques  to  be  discussed  can  be  broadly  classified 
into  two  groups.  In  the  first,  presented  in  Section  I I I . 1 , the  short- 
time  spectral  amplitude  is  estimated  in  the  frequency  domain,  using  the 
spectrum  of  the  degraded  speech.  Each  short-time  segment  of  the  enhanced 
speech  waveform  in  the  time  domain  is  then  obtained  by  inverse  transforming 
this  spectral  amplitude  estimate  combined  with  the  phase  of  the  degraded 
speech.  In  the  second  class,  discussed  in  Section  III. 2 the  degraded 
speech  is  first  used  to  obtain  a filter  which  is  then  applied  to  the 
degraded  speech.  Since  these  procedures  lead  to  zero-phase  filters,  it 
is  again  only  the  spectral  amplitude  that  is  enhanced,  with  the  phase  of 
the  filtered  speech  being  identical  to  that  of  the  degraded  speech. 

In  both  classes  of  systems  discussed  below  no  conceptual  distinction 
is  made  between  voiced  and  unvoiced  speech  and  in  particular  in  contrast 
to  the  techniques  to  be  discussed  in  section  IV  the  periodicity  of 
voiced  speech  is  not  exploited.  Both  classes  of  systems  in  this  section 
are  most  easily  interpreted  in  terms  of  a stochastic  characterization  of 
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the  speech  signal.  While  this  characterization  is  more  justifiable  for 
unvoiced  speech  it  has  been  shown  empirically  to  also  lead  to  successful 
procedures  for  voiced  speech. 

111. 1 SPEECH  ENHANCEMENT  BASED  ON  DIRECT  ESTIMATION  OF  SHORT-TIME 
SPECTRAL  AMPLITUDE 

When  a stationary  random  signal  s(n)  has  been  degraded  by 
uncorrelated  additive  noise  d(n)  with  a known  power  density  spectrum, 
the  power  density  spectrum  or  spectral  amplitude  of  the  signal  is  easily 
estimated  through  a process  of  spectral  subtraction.  Specifically,  if 
y(n)=  s(n)+  d(n)  (1) 

and  Py.(w)>  PgC00)  and  P^C^)  represent  the  power  density  spectra  of  y(n), 
s(n)  and  d(n)  respectively,  then 

Py(u>)=  Ps(w)+  Pd(u)  (2) 

Consequently,  a reasonable  estimate  for  Ps(uj)  is  obtained  by  subtracting 
the  known  spectrum  Pd(w)  from  an  estimate  of  Py(w)  developed  from  the 
observations  of  y(n) . 

Speech,  of  course,  is  not  a stationary  signal.  However,  with  s(n) 

in  eq.  (1)  now  representing  a speech  signal  and  with  the  processing  to 

be  carried  out  on  a short-time  basis  we  consider  s(n),  d(n)  and  y(n) 

multiplied  by  a time-limited  window  w(n).  With  y (n) , d (n)  and  s (n) 

w w w 

denoting  the  windowed  signals  y(n),  d(n)  and  s(n)  and  Yw(co),  Dw(u))  and 
S w(w)  as  their  respective  Fourier  transforms  we  have 

yw(n)=  Swfn)+  dwCn)  (3;) 

and 

lYw(u°|2=  lSw(w)|2  + lD„C^0  I 2 + Sw(w)-Dw*(u)  + Sw*Ct0)-Dw(ai)  (4) 

★ * 

where  Dw  (w)  and  represent  complex  conjugates  of  (oo)  and  Sw(a>). 
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The  function  |Sw(u>)|  will  be  referred  to  as  the  short-time  energy 
spectrum  of  speech.  For  speech  enhancement  based  on  the  short-time 

A 

spectral  amplitude,  the  objective  is  to  obtain  an  estimate  |Sw(w)|  of 

A 

|Sw(io)|  and  from  this,  an  estimate  s^Cn)  of  sw(n). 

A 

From  the  estimate  sw(n),  speech  can  be  generated  in  a variety  of 
different  ways.  One  approach  is  to  use  an  analysis  window  function  w(n) 
that  generates  s(n)  when  all  the  frames  of  sw(n)  are  overlapped  and 
added  with  the  appropriate  time  registration.  Such  a window  function 
satifies  the  equation 

£ w.(n)  = 1 for  all  n of  interest  (5) 

i 

where  w^(n)  represents  the  ith  window  frame.  Two  such  examples  are 
overlapped  triangular  and  hamming  windows.  Using  such  a window  function, 
speech  is  then  generated  by  adding  up  the  estimates  of  the  windowed 
segments. 

Various  speech  enhancement  techniques  discussed  in  this  section 
differ  primarily  in  how  |Sw(uj)|  is  specifically  estimated  from  the  noisy 
speech.  In  one  spectral  subtraction  technique  referred  to1  as  power 
spectrum  subtraction,  |Sw(w)|  is  estimated  based  on  eq.  (4).  From  the 
observed  data  yw(n),  |Yw(uj)|  can  be  obtained  directly.  The  terms 


*The  name  "power  spectrum  subtraction"  comes  from  the  close 
similarity  between  eq.  (2)  and  eq.  (6). 
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]D  (o))|  , S (oj).D  (u>)  and  S (a)).D  (co)  cannot  be  obtained  exactly  and  in 

w w w w w 

the  power  spectrum  subtraction  technique  they  are  approximated  by 

E[|D  (u)|2],  E[S  (u>).D  (u»]  and  E[S  (co).D  (w)]  where  E[.]  denotes  the 

w w w w w 

2 

ensemble  average.  For  d(n)  zero  mean  and  uncorrelated  with  s(n), 

* ★ A o 

E[S  (w)  .D  (a))]  and  E[S  (o>) . D (co) ] are  zero  and  an  estimate  |S  (oj) 
w w w w W 

2 

of  |Sw(uj)|  , is  suggested  from  eq.  (4)  as 

IV“)|2  - |YwCw)l2  - E[|DW(0))|2],  (6) 


2 

where  E [ | (co) | ] is  obtained  either  from  the  assumed  known  properties 
of  d(n)  or  by  an  actual  measurement  from  the  background  noise  in  the 
intervals  where  speech  is  not  present.  From  eq.  (6),  |Sw(u))|  is  not 
guaranteed  to  be  non-negative  since  the  right  hand  side  can  become 
negative,  and  a number  of  somewhat  arbitrary  choices  have  been  made.  In 
some  studies,  the  negative  values  are  made  positive  by  changing  the 
sign.  In  some  other  studies  |Sw(uj)|  is  set  to  zero  if  | Y^Cco)  | is 
less  than  E[|Dw(<a)|  ].  The  latter  approach  has  been  more  extensively 
used  in  the  literature,  and  as  will  be  seen  later  it  can  be  related 
directly  to  the  optimum  filtering  technique  discussed  in  section  I I 1.2 


The  zero  mean  assumption  for  the  additive  random  noise 
is  made  only  for  notational  convenience. 
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Given  an  estimate  of  |Sw(u)|,  there  are  a variety  of  different  ways 

to  estimate  sw(n) . One  method  which  has  been  used  extensively  in  the 

class  of  systems  discussed  in  this  section  and  is  also  consistent  with 

the  notion  that  short-time  phase  is  relatively  unimportant  is  to  approximate 

(to) , the  phase  of  S (to) , by  (10) . Then 
w w w 

ls[(«)|.e  j,Yw(u)  (7) 

and  sw(n)=F'1[Sw(o))]  (8) 

A typical  algorithm  for  speech  enhancement  by  the  power  spectrum  subtraction 
technique  is  shown  in  Figure  3. 

Except  for  some  details  and  interpretations,  the  power  spectrum 
subtraction  technique  discussed  above  is  a special  case  of  a more  general 
system  originated  by  Weiss,  et  al.  (15,16). 

Specifically,  the  power  spectrum  subtraction  technique  can  also  be 
interpreted  in  terms  of  estimating  the  short  time  correlation  <|>s(n) 
as 

4>s(n)  = <)>y(n)-E[<t>d(n)]  (9) 

where  <j>  (n)  = 2 s (k).s  (k-n)  = F_1[|S  (w)|2]  (10) 

s k=-°°  w w 

and  4>s (n ) and  4>d(n)  are  similarly  defined.  For  this  reason,  the 
power  spectrum  subtraction  technique  is  also  referred  to  as  the 
correlation  subtraction  technique.  Weiss,  et  al  focussed  on  estimating 
the  short  time  correlation  function  and  in  place  of  a squaring  operation 
used  an  arbitrary  positive  real  constant  "a".  In  their  technique,  then, 
by  defining  4>g(n)  to  be  F * [ | Sw(u))  | a] , chs  (n)  is  estimated  as 
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4>s(n)  = <l>y(n)  - E[<t>d(n)] 

- F'1UYw(u))|a]-E[F-1[|Dw(aJ)|a]] 

Based  on  this  estimate  and  the  assumption  that  }Sw(u))  equals  $Yw(io), 
the  windowed  speech  sw(n)  is  estimated.  The  speech  enhancement  system 
proposed  by  Weiss  et  al  is  shown  in  Figure  4. 

The  system  in  Figure  4 can  be  simplified  both  computationally  and 
conceptually  (18,19)  by  recognizing  that  the  expectation  and  Fourier 
transform  operations  in  eq.  (11)  are  inter-changeable  and  therefore  eq. 
(11)  is  equivalent  to 

(Sw(w)|a  = |Yw(W)|a-E((Dw(W)la]  (12) 

Such  a simplified  system  based  on  eq.  (12)  is  shown  in  Figure  5.  As  is 
evident  in  Figure  5 the  system  proposed  by  Weiss,  et  al  is  a technique 
to  estimate  the  short-time  spectral  amplitude  of  speech  by  a particular 
form  of  spectral  subtraction.  The  performance  of  the  system  in  Figure  5 
as  a speech  enhancement  system  was  evaluated  by  Lim  (19)  and  the  results 
will  be  discussed  in  Section  VI.  When  the  constant  "a"  is  set  to  unity, 
the  system  in  Figure  5 reduces  to  the  speech  enhancement  system  developed 
by  Boll  (20). 

The  parameter  "a"  in  eq.  (12)  obviously  affords  a degree  of  flexi- 
bility over  the  system  based  on  eq.  (6).  A further  generalization  is  to 
introduce  an  additional  degree  of  flexibility  by  estimating  |Sw(w)| 
through  the  relation 

|S^(co)  |a  = |Yw(u»|a  - k E[|Dw(w)|a]  (13) 

where  now  there  are  the  two  parameters  a and  k.  This  generalization 
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with  a and  k as  parameters  was  considered  for  speech  enhancement  by  Lim 
(18)  and  Berouti  et  al  (21).  Just  as  with  the  specific  form  of  spectral 
subtraction  in  eq.  (6),  each  short-time  speech  segment  is  in  effect 
estimated  by  utilizing  the  phase  associated  with  the  noisy  speech  and 
negative  values  on  the  right-hand  side  of  eq.  (13)  can  be  dealt  with 
through  the  use  of  full-wave  or  half-wave  rectification.  The  additional 
possibility  of  also  utilizing  a frequency  dependent  threshold  on  the 
right-hand  side  of  eq.  (13)  was  considered  by  Berouti  et  al,  (21). 

Another  approach,  which  leads  to  a further  modification  of  spectral 
subtraction  was  proposed  by  McAulay  and  Malpass  (32) . In  this  approach, 
the  problem  was  formulated  by  assuming  that  at  each  frequency  the  noise 
is  Gaussian  and  developing  the  maximum  likelihood  estimate  of  |Sw(w)|. 

The  resulting  estimate  has  the  form 

|SW(»)|  - j|Yw(0))|+  \ [|Yw(0,)|2-E[|Dw(0))|2]]1/2  (14) 

A further  variation,  proposed  by  McAulay  and  Malpass  (32)  modifies  eq. 

(14)  by  a factor  which  is  chosen  to  represent  as  a function  of  |Yw(oj)| 
the  probability  that  speech  is  in  fact  present  in  the  signal  y(n) . 
Modification  of  eq.  (14)  by  this  factor  is  based  on  the  notion  that  as 
the  probability  that  only  noise  is  present  increases,  it  might  perhaps 
be  preferable  to  further  reduce  the  estimate  of  |Sw(u>)|.  Other  techniques 
for  speech  enhancement  similar  or  very  closely  related  to  the  various 
spectral  subtraction  techniques  discussed  above  include  the  work  of 
Curtis  and  Niederjohn  (17)  and  Preuss  (22). 
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In  this  section,  we  have  discussed  a variety  of  different  techniques 
to  estimate  the  short  time  spectral  amplitude  of  speech.  Many  of  them 
can  be  viewed  as  attempting  to  enhance  the  S/N  (Signal  to  Noise)  ratio 
by  not  affecting  the  spectral  components  corresponding  to  relatively 
high  S/N  ratio  but  attenuating  those  corresponding  to  relatively  low 
S/N  ratio.  To  illustrate  this  point,  consider  the  spectral  subtraction 
method  corresponding  to  eq.  (13)  with  the  assumption  that  a=2  and  that 

/s. 

the  right  hand  side  is  positive.  Expressing  the  estimate  Sw(io)  in 
the  form  of  a zero-phase  frequency  response  H(u)  applied  to  Yw(w) , 


H(w)  = 


Y (w) 

— 7Z 

S (w) 
wv  J 


I Yu(w)  I 2- k . E [ | D (u>)  | 2]  \1/2 


lYw<“> 


Eq.  (15)  can  be  rewritten  as 

H(.)  -{ \/2 

\X2(o))  / 


where  X2(o>)  = | Yw(w)  | 2/E  [ | Dw(co)  | 2] 


From  eq.  (17),  X(to)  can  be  interpreted  as  a signal  plus  noise  to  noise 
ratio  at  each  frequency  w.  In  Figure  6 is  plotted  20  log  H(w)  for 
different  values  of  the  constant  "k”  as  a function  of  20  log  X(w).  It 
is  clear  from  the  figure  that  the  frequency  components  of  Yw(w)  corres- 
ponding to  low  S/N  ratio  are  severely  attenuated.  As  another  example, 
a similar  plot  representing  the  speech  enhancement  system  corresponding 
to  eq.  (14)  derived  from  maximum  likelihood  considerations  is  also  shown 
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in  Figure  6.  The  results  in  Figure  6 are  generally  applicable  to  various 
short  time  spectral  amplitude  estimation  techniques  discussed  in  this 
section  and  will  be  useful  in  understanding  the  results  of  the  performance 
evaluation  discussed  in  section  VII. 

III. 2 SPEECH  ENHANCEMENT  TECHNIQUES  BASED  ON  WIENER  FILTERING 

In  the  previous  section,  the  basis  for  enhancement  was  the 
explicit  estimation  of  the  short-time  spectral  magnitude  through  a 
process  of  spectral  subtraction.  In  this  section,  we  discuss  techniques 
in  which  a frequency  weighting  for  an  "optimum"  filter  is  first  estimated 
from  the  noisy  speech.  This  filter  is  then  applied  either  in  the  time 
domain  or  frequency  domain  to  obtain  an  estimate  of  the  undegraded 
speech.  Thus,  with  Yw(ui),  Dw(u»)  and  S^Cuj)  again  denoting  the  short-time 
spectra  associated  with  the  windowed  time  functions  y(n),  d(n),  and 

A 

s(n),  the  estimate  Sw(uj)  of  Sw(u>)  takes  the  form 

Sw(w)  = H(u>)  Yw(u>)  (18) 

As  we  saw  in  eq.  (15),  the  techniques  in  Section  III.l  can  also  be  put 
into  this  form  and  consequently  the  essential  difference  between  the 
techniques  presented  in  that  section  and  those  to  be  discussed  here 
rests  in  the  basis  on  which  the  frequency  weighting  H(u>)  is  obtained.  In 
this  section  we  focus  on  procedures  for  obtaining  H(u>)  based  on  the 
principles  of  Wiener  filtering.  However,  as  we  will  see  toward  the  end 
of  this  section,  an  implicit  form  of  this  procedure  leads,  in  fact  to 
frequency  weightings  identical  to  several  discussed  in  Section  III.l. 
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As  is  well  known,  for  y(n)=  s(n)+d(n)  in  which  s(n)  and  d(n) 

represent  uncorrelated  stationary  random  processes  with  power  density 

spectra  P (<a)  and  P,  (to)  respectively,  the  linear  estimator  of  s(n)  which 
s a 

minimizes  the  mean  square  error  is  obtained  by  filtering  y(n)  with  the 

r 

N 

non-causal  Wiener  filter  for  which  the  frequency  response  is 


H(w) 


Ps^ 

Ps(»)*Pdt“) 


(19) 


The  non-causal  Wiener  filter  of  eq.  (19)  cannot  be  applied  directly  to 
estimate  s(n)  since  speech  cannot  be  assumed  to  be  stationary  and  the 
spectrum  Pg (cu)  cannot  be  assumed  known.  The  general  approach  is  to 
approximate  the  non-causal  Wiener  filter  with  an  adaptive  Wiener  filter 
with  frequency  response 


E[|S  (co)  |2] 

H(w)  = 


E[|sw(u>)|2]+E[|Dw((d)|2] 


(20) 


2 

As  in  Section  III.  1 , the  function  E[|Dw(co)|  ] may  be  obtained  either 

from  the  assumed  known  statistics  of  d(n)  or  by  averaging  many  frames  of 

|Dw(gj)|  during  silence  intervals  in  which  the  statistics  of  the  background 

noise  can  be  assumed  to  be  stationary.  In  estimating  E[|  S (w)n, 

there  are  a variety  of  possibilities.  Callahan  (33)  first  estimates 
2 2 

E[|Yw(w) I ] by  locally  averaging  |Yw(io)|  over  many  frames  of  noisy 

speech.  Then  E [ | D (to)  | ] is  subtracted  from  the  estimated  E [ | Y Coo)  | ] 
w w 

2 

to  form  an  estimate  of  E[|Sw(to)|  ].  An  equally  reasonable  method  is  to 

2 2 

first  estimate  E[|Yw(to)|  ] by  smoothing  |Yw(to)|  rather  than  averaging 

2 2 
|Yw(w) I over  many  frames  of  noisy  speech  and  then  subtracting  E [ | (oo) | ] 

9 9 

from  the  estimated  E[|Yw(to)|  ].  As  other  possibilities  E[|Sw(ui)|  ] may 
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2 ~ 2 ^ 2 

be  approximated  as  |Sw(w)|  or  by  smoothing  |Sw(u>)|  where  |Sw(u>)|  is 


obtained  from  the  short-time  spectral  amplitude  estimation  techniques 
discussed  in  Section  III.l. 

Given  H(u)),  the  short-time  speech  segment  is  then  obtained  as 
specified  by  eq.  (18)  applied  either  in  the  time  domain  or  in  the  frequency 
domain.  It  should  be  noted  that  in  all  of  the  above  procedures,  the 
frequency  weighting  H(to)  has  zero  phase  and  thus  from  eq.  (18)  the 


phase  associated  with  the  estimate  Sw(to)  is  that  of  Y^fu)) . Thus  just  as 


with  the  procedures  in  Section  III.l,  it  is  only  the  spectral  magnitude 


of  Sw(<a)  which  is  estimated. 


Generalizations  of  Wiener  filtering  may  also  be  considered.  One 
such  generalization  which  has  been  studied  extensively  (34,35)  in  the 
context  of  image  restoration  has  the  frequency  response  given  by 

vS 


H(oi)  = 


PsCa» 


(21) 


Ps(u)+a.Pd(oj) 


for  some  constants  "a"  and  "3"  and  has  been  referred  to  as  parametric 
Wiener  filters.  By  varying  the  constants  "a"  and  "3",  filters  with 
different  characteristics  can  be  obtained.  For  example  if  a and  3 are 
unity,  then  eq.  (21)  corresponds  to  Wiener  filtering  as  specified  in  eq. 
(19).  If  a is  unity  and  3 is  1/2,  then  eq.  (21)  corresponds  to  power 
spectrum  filtering  (36)  which  has  the  characteristics  that  the  enhanced 


signal  has  the  same  power  spectrum  Pg (oj)  used  in  eq.  (21).  Again,  due  to 


the  non-stationarity  of  speech,  eq.  (21)  has  to  be  modified.  The 
approximation  of  Ps (u>)  and  P^too)  by  the  corresponding  short-time  energy 
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spectra  and  generation  of  speech  based  on  the  estimated  H(uj)  have  already 
been  discussed.  With  this  approximation,  the  frequency  response  associated 
with  short  time  parametric  Wiener  filtering  would  then  be  expressed  as 

r 1 B 


H(«)  = 


E[isw(u»n 

E[|Sw(u>)|2+  otE[|Dw(u>)  | 2] 


In  the  Wiener  filter  of  eq.  (20)  or  its  generalized  form  of  eq. 

2 

(22)  it  is  assumed  that  the  term  representing  Pg(w)  or  E [ | (co)  | ] is 
first  obtained  and  the  frequency  weighting  is  then  applied  to  Yw(iu)  . 

An  alternative  is  to  treat  eq.  (20)  and  eq.  (22)  as  implicit  relationships 
For  example,  let  us  estimate  E [ | (co)  | ] as 

E[|Sw(u>)|2]  = |Sw(co)  | 2 (23) 

A 

where  Sw(u>)  is  the  estimate  of  the  short-time  spectrum  of  the  speech. 

Then, 

Sw(u>)  = H(w)  Yw(w) 


S (u>) 
wv  ' 


|sw(«) 


|Sw(io)  |2  ♦ otE[|Dw(u»|2] 


so  that 


Y (w) 

w 


r i lV“)l2 

|Sw(u)|  = 

|SW(U))|2  + aE(|Dw(u>)|2] 


|YW(“»I  (25) 


This,  of  course  is  an  implicit  relationship,  from  which  we  would  like  to 

A 

obtain  |Sw(u))|  and  thus  we  refer  to  it  as  implicit  Wiener  filtering.  For 


— 


example,  two  solutions  to  eq.  (23)  for  3=1/2  are 


Sw(«o)|=0 

(26a) 

Sw(<o)|  = [|Yw((o)|2-aE[|Dw((o)|2]]1/2 

(26b) 

Thus,  a solution  for  |Sw(<o)|  consistent  with  eq.  (25)  is  eq.  (26b)  for 

positive  values  under  the  radical  and  zero  otherwise.  This,  of  course, 

is  precisely  the  spectral  subtraction  method  of  eq.  (13)  with  a=2. 

Similarly,  for  3=1  a solution  to  eq.  (25)  is 

^ 1/2 
|Sw(w)|  = j|Yw(w)|  + y[|Yw(aj)|2-4aE[|Dw(u>)|2]]  (27) 

For  a=l/4  this  is  identical  to  the  maximum  likelihood  estimate  of  eq. 
(14). 

Another  potential  generaliztion  of  Wiener  filtering  stems  from 
considering  an  iterative  approach  to  estimating  E[|Sw(to)|2]  in  eq.  (22). 
For  example,  let  us  consider  an  iterative  procedure  whereby 

A 

| S (to) . I denotes  the  estimate  of  |S  (w)  | on  the  ith  iteration  with 
w x w 

S (to).  . = H.  (to)  Y (to)  (28) 

W 'l+l  1 w 

The  transfer  function  H^(io)  is  in  the  form  of  eq.  (22)  with 
E[|Sw(to)|  ] estimated  from  Sw(<o)^.  In  such  iterative  procedures  there 
are,  of  course  issues  of  convergence  which  will  in  general  depend  on  the 
way  in  which  the  iteration  is  started  and  on  specifically  how 
E [ | S (io)  | ] is  estimated  from  S (to).. 

W W 1 
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IV.  SPEECH  ENHANCEMENT  TECHNIQUES  BASED  ON  PERIODICITY  OF  VOICED  SPEECH 

In  this  Section,  we  discuss  speech  enhancement  techniques  which  capitalize 
on  the  observation  that  waveforms  of  voiced  sounds  are  periodic  with  a period 
that  corresponds  to  the  fundamental  frequency.  Even  with  this  basic 
underlying  principle  many  different  approaches  are  possible.  In  Section  IV. 1, 
we  discuss  an  approach  based  on  comb  filtering  to  pass  the  harmonics  of  speech 
but  reject  the  frequency  components  between  the  harmonics.  In  Section  IV. 2, 
we  consider  the  extraction  of  speech  harmonics  from  a high  resolution  spectrum 
of  noisy  speech.  In  Section  IV. 3,  we  discuss  the  use  of  adaptive  noise 
cancelling  techniques  to  reduce  the  background  noise  by  capitalizing  on  the 
periodicity  of  voiced  sounds  to  provide  a reference  input. 

IV.  1 SPEECH  ENHANCEMENT  BASED  ON  ADAPTIVE  COMB  FILTERING 

The  periodicity  of  a time  waveform  manifests  itself  in  the  frequency 
domain  as  harmonics  with  the  fundamental  frequency  corresponding  to  the  period 
of  the  time  waveform  as  is  shown  in  Figure  7.  In  Figure  7 (a) is  shown  a 
segment  of  a periodic  time  waveform  and  in  Figure  7(b)  is  shown  the  magnitude 
spectrum  of  the  time  waveform  in  Figure  7(a).  Since  the  energy  of  a periodic 
signal  is  concentrated  in  bands  of  frequencies  as  is  shown  in  Figure  7(b)  and 
the  interfering  signals  in  general  have  energy  over  the  entire  frequency 
bands,  to  the  extent  that  accurate  information  of  the  fundamental  frequency  is 
available,  a comb  filter  as  shown  in  Figure  7(c)  can  reduce  noise  while 
preserving  the  signal. 

Even  though  voiced  speech  is  only  approximately  periodic,  the  concept  of 
comb  filtering  to  reduce  the  background  noise  in  noisy  speech  may  still  be 
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applicable.  One  approach  to  enhancing  degraded  speech  through  comb  filtering 
was  taken  by  Shields  (37).  A typical  impulse  response  of  a comb  filter  as 
applied  by  Shields  is  shown  in  Figure  8(a).  The  spacing  "T"  in  the  figure 
represents  the  pitch  period  and  a different  value  of  "T"  is  chosen  in 
processing  different  parts  of  voiced  speech  to  adapt  globally  to  the  time 
varying  nature  of  speech.  Frazier,  et  al  (38),  observed  that  even  with 
accurate  fundamental  frequency  information  Shields'  adaptive  comb  filtering 
technique  distorts  speech  signals  significantly  due  to  the  time  varying  nature 
of  speech  even  on  a short  time  (local)  basis.  To  reduce  some  of  this 
distortion,  Frazier,  et  al  (38)  suggested  a filter  that  adapts  itself  both 
globally  and  locally  to  the  time  varying  nature  of  speech.  A typical  impulse 
response  of  Frazier's  adaptive  filter  is  shown  in  Figure  8(b).  The  spacing  "T^" 
in  Figure  8(b)  is  adapted  to  the  local  variation  of  the  pitch  periods  of 
voiced  speech.  A typical  algorithm  for  speech  enhancement  by  adaptive  comb 
filtering  (or  adaptive  filtering)  is  shown  in  Figure  9. 

IV. 2 SPEECH  ENHANCEMENT  BASED  ON  HARMONIC  SELECTION 

The  adaptive  filtering  technique  discussed  in  Section  IV. 1 requires 
accurate  pitch  information  and  therefore  a separate  system  that  estimates  the 
pitch  information  is  necessary.  In  the  context  of  an  application  in  which  the 
interfering  background  noise  is  a competing  speaker,  Parsons  (23)  developed  a 
system  which  is  closely  related  to  comb  filtering  with  the  pitch  information 
obtained  as  an  integral  part  of  the  system.  Voiced  speech  is  windowed  and  a 
high  resolution  short  time  spectrum  is  obtained.  In  the  short  time  spectrum, 
the  periodicity  of  speech  exhibits  itself  as  local  spectral  peaks  some  of 
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PROCESSED  SPEECH 


Fig.  9.  A typical  algorithm  for  speech  enhancement  by  adaptive  comb 
filtering  or  adaptive  filtering. 


which  are  due  to  the  main  speaker  and  some  others  of  which  are  due  to  a 
competing  speaker.  Parsons  developed  a technique  in  which  each  of  the  local 
spectral  peaks  in  the  high  resolution  short  time  spectrum  is  distinguished 
between  the  main  speaker  and  a competing  speaker.  Then  speech  is  generated 
based  on  the  spectral  content  that  corresponds  to  the  peaks  of  the  main 
speaker.  Since  the  essence  of  Parsons'  system  is  location  and  selection  of 
speech  harmonics  of  a speakei  from  the  high  resolution  spectrum  of  degraded 
speech,  it  can  be  approximately  viewed  as  a frequency  domain  implementation 
of  a pitch  information  extracter  and  an  adaptive  filter. 

IV. 3 SPEECH  ENHANCEMENT  BASED  ON  ADAPTIVE  NOISE  CANCELLING  TECHNIQUES 

A class  of  techniques  referred  to  as  adaptive  noise  cancelling  have  been 
developed  which  are  based  on  the  availability  of  both  the  degraded  signal 
y(n)=s(n)+d(n)  and  a reference  signal  r(n)  which  is  uncorrelated  with  s(n)  but 
correlated  with  d(n).  A block  diagram  representation  of  such  a system  is 
shown  in  Figure  10.  By  adaptively  filtering  r(n)  an  estimate  of  the  component 
d(n)  that  is  correlated  with  r(n)  is  formed  and  subtracted  from  y(n). 

Adaptive  noise  cancelling  is  applicable  to  processing  of  inputs  whose 
properties  are  unknown,  and  good  performance  can  be  achieved  if  a suitable 
reference  input  is  available.  A detailed  discussion  of  the  principles, 
implementations,  etc.  of  adaptive  noise  cancelling  can  be  found  in  (39). 

As  mentioned  in  the  introduction,  the  discussion  in  this  paper  is 
restricted  to  systems  for  which  the  only  signal  available  is  the  degraded 
speech  and  thus  adaptive  noise  cancelling  as  outlined  above  would  not  be 
applicable.  However,  Sambur  (24)  had  developed  a system  which  utilizes  the 
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principles  of  adaptive  noise  cancelling  by  generating  a reference  input, 
capitalizing  on  the  periodicity  of  voiced  speech.  Specifically,  let  the 
reference  input  r(n)  be  given  by  r(n)  = y(n-T),  where  T represents  the  pitch 
period.  To  the  extent  that  periodicity  is  strictly  observed, 

r(n)  = s(n-T)  + d(n-T)  = s(n)  + d(n-T).  (29) 

Reversing  the  roles  of  s(n)  and  d(n)  in  Figure  10,  r(n)  can  be  viewed  as 
uncorrelated  with  d(n)  to  the  extent  that  the  correlation  of  d(n)  is  short  and 
the  adaptive  filter  has  a short  impulse  response  relative  to  the  pitch  period 
T.  Since  the  component  s(n)  in  r(n)  is  identical  to  the  s(n)  in  the  primary 
input  y(n),  the  output  of  the  adaptive  filter  in  Figure  10  would  correspond  to 
an  estimate  of  s(n).  The  adaptive  noise  cancelling  technique  proposed  by 
Sambur  is  shown  in  Figure  11(a).  An  alternative  approach  to  Sambur's 
technique  is  shown  in  Figure  11(b).  In  the  figure,  a reference  input  r(n)  is 
specified  as 


r(n)  = y(n)-y(n-T) 


ADAPTIVE 

FILTER 


Fig.  11.  (a)  An  adaptive  noise  cancelling  technique  for  speech  enhancement 

by  Sambur  (24).  (b)  Another  adaptive  noise  cancelling  technique  speech 

enhancement . 
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r(n)  = s(n)+d(n)  - s(n-T)-d(n-T)  =■  d(n)-d(n-T)  (31) 


Then  r(n)  is  uncorrelated  with  s(n)  but  is  highly  correlated  with  d(n)  thus 
satisfying  the  condition  for  adaptive  noise  cancelling. 

In  this  section,  we  have  discussed  various  speech  enhancement  techniques 
which  capitalize  on  the  periodicity  of  voiced  speech.  Depending  on  how  the 
periodicity  of  voiced  speech  is  specifically  exploited  on,  different 
techniques  have  been  developed.  All  of  them,  however,  have  the  common  feature 
that  they  are  based  only  on  the  periodicity  of  voiced  speech  and  require 
accurate  pitch  information.  Techniques  for  extracting  the  pitch  information 
from  noisy  speech  will  be  discussed  in  Section  VI.  Some  performance 
evaluation  results  and  potential  advantages  and  disadvantages  of  the 
techniques  discussed  in  this  Section  will  be  presented  in  Section  VII. 


V.  SPEECH  ENHANCEMENT  TECHNIQUES  BASED  ON  A SPEECH  MODEL 

A digital  model  of  sampled  speech  that  has  been  used  in  a number  of 
practical  applications  and  has  a basis  (6,7,8)  in  the  physics  of  speech 
production  system  was  shown  in  Figure  1.  In  the  model,  the  excitation 
source  is  either  a quasi-periodic  train  of  pulses  for  voiced  sounds  or 
random  noise  for  unvoiced  sounds.  The  digital  filter  represents  the 
effects  of  the  vocal  tract,  lip  radiation,  and,  for  voiced  sounds,  the 
glottal  source.  Since  the  vocal  tract  changes  in  shape  as  a function  of 
time,  the  digital  filter  in  Figure  1 is  in  general  time  varying.  However, 
over  a short  interval  of  time,  the  digital  filter  may  be  approximated  as 
a linear  time  invariant  system.  Many  systems  which  capitalize  on  the 
underlying  speech  model  discussed  above  have  been  proposed  in  the 
literature  for  speech  enhancement  and  in  this  section  we  discuss  some  of 
those  techniques. 

In  the  speech  enhancement  technique  based  on  an  underlying  speech 
model,  the  parameters  of  the  speech  model  are  first  estimated  and  then 
speech  is  generated  based  on  the  estimated  parameters.  The  parameters 
of  the  model  consist  of  the  source  parameters  (pitch  information)  and 
the  system  parameters  (vocal  tract  information).  The  problem  of  estimating 
the  source  parameters  from  noisy  speech  will  be  discussed  in  Section  VI 
where  we  discuss  techniques  for  bandwidth  compression  of  noisy  speech, 
and  in  this  section  we  consider  techniques  for  estimating  the  system 
parameters.  Given  the  estimated  parameters  of  a speech  model,  speech 
can  be  generated  by  a synthesis  system  based  on  the  same  underlying 
speech  model  or  by  designing  a filter  with  the  estimated  speech  model 
parameters  and  then  filtering  the  noisy  speech.  The  former  approach 
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requires  both  the  source  and  system  parameters  while  the  latter  approach 
generally  requires  only  the  system  parameters  as  will  be  discussed 
later. 

The  techniques  to  estimate  the  system  parameters  of  a speech  model, 
of  course,  depend  on  the  specific  model  assumed.  Even  for  the  same 
speech  model,  however,  there  are  again  a variety  of  different  techniques 
that  may  be  used  in  estimating  the  model  parameters.  In  Section  V.l,  we 
discuss  speech  enhancement  techniques  based  on  an  all  pole  model  of  the 
vocal  tract  and  in  Section  V.2,  techniques  based  on  a pole-zero  model  of 
the  vocal  tract.  In  Section  V.3,  we  discuss  techniques  based  on  non- 
parametric  speech  models. 

V.l  SPEECH  ENHANCEMENT  TECHNIQUES  BASED  ON  AN  ALL-POLE  MODEL  OF  SPEECH 

In  an  all  pole  model  of  speech,  the  transfer  function  V(z)  in 
Figure  1 is  modelled  on  a short  time  basis  as  all-pole  of  the  form 

V(z)  = 1 (32) 

P 

1-  £ a,.  z"k 

k 

k=l 

where  "p"  represents  the  order  of  all  pole  model.  Thus,  on  a short  time 
basis  the  speech  waveform  s(n)  is  assumed  to  satisfy  a difference 
equation  of  the  form 
P 

s(n)  = E ak  . s(n-k)  + u(n)  (33) 

k=l 
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where  u(n)  is  a pulse  train  for  voiced  speech  or  random  noise  for 
unvoiced  speech. 

Equation  (33)  is  sometimes  referred  to  as  an  autoregressive  model 
or  as  a linear  prediction  model  since  the  current  sample  s(n)  can  be 
viewed  as  being  predicted  from  a linear  combination  of  past  samples 
of  5<"n)  with  an  error  of  u(n).  For  notational  convenience,  the  all  pole 
parameters  will  be  denoted  in  vector  form  as 


The  problem  of  estimating  a given  a segment  of  s(n)  has  been 
considered  extensively  (40,41)  in  the  literature.  In  the  absence  of 
background  noise,  many  different  approaches  (40)  to  estimate  £ lead  to 
solving  essentially  the  same  set  of  linear  equations  of  the  form 
R . a_  = £ (35) 

where  R is  a pxp  matrix  and  r is  a pxl  matrix.  Depending  on  how  the 
matrices  R and  r are  specifically  obtained  from  s(n),  equation  (35)  is 
referred  to  as  either  the  correlation  or  covariance  method  of  linear 
prediction  analysis.  The  principal  advantages  of  the  correlation  method 
are  that  R in  equation  (35)  is  a Toeplitz  matrix  so  that  particularly 
efficient  algorithms  (42)  to  solve  equation  (35)  exist  and  the  resulting 
all  pole  coefficients  are  guaranteed  to  be  stable. 
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The  problem  of  estimating  the  all  pole  parameters  from  the  noisy 
speech  is  a much  more  difficult  problem  and  different  approaches  generally 


lead  to  different  results.  One  approach  is  to  simply  solve  equation 
(35)  for  all  pole  parameters  a where  the  components  of  R and  r are 
estimated  accounting  for  the  presence  of  noise.  In  the  correlation 
method  of  linear  prediction  analysis,  the  components  of  R and  £ consist 
of  the  first  p+1  points  of  the  correlation  of  sw(n)  representing  s(n) 
multiplied  by  a time  limited  window  w(n)  as  introduced  in  Section  III.l. 
The  Fourier  transform  of  the  correlation  of  sw(n)  is  |Sw(w)|  and  in 
Section  III  we  have  discussed  various  techniques  to  estimate  |Sw(w)| 


from  the  noisy  speech.  Then  one  approach  would  be  to  estimate 
|Sw(w)|  from  the  noisy  speech  by  one  of  the  techniques  discussed  in 
Section  III,  form  R and  £ from  the  inverse  transform  of  this  estimate 
and  then  solve  for  a in  equation  (35).  Techniques  to  estimate  the  all 
pole  coefficients  in  this  way  have  been  considered  by  Magill  and  Un 
(28),  Kobatake,  et  al,  (43),  and  Lim  (18). 

j 

A more  theoretical  approach  to  the  problem  of  estimating  the  all 
pole  coefficients  a is  to  use  well  known  parameter  estimation  rules. 


Before  we  discuss  this  approach  in  greater  detail,  we  review  very 
briefly  the  general  principles  of  parameter  estimation. 

Let  A and  R denote  the  parameter  space  and  the  observation  space, 
and  assume  that  there  is  a probabilistic  mapping  between  these  spaces 
with  a point  a in  the  parameter  space  mapped  to  a point  r in  the  observation 
space.  The  parameter  estimation  problem  is  to  estimate  the  value  of  a 
from  the  observation  r,  using  some  estimation  rule.  The  three  estimation 


rules  known  as  Maximum  Likelihood  (ML),  Maximum  A Posteriori  (MAP)  and 
Minimum  Mean  Square  Error  (MMSE)  estimation  have  many  desirable  properties 
and  thus  have  been  studied  (44,45)  extensively  in  the  literature.  For 
non-random  parameters,  the  ML  estimation  rule  is  often  used.  In  the  ML 
estimation,  the  parameter  value  is  chosen  such  that  the  chosen  value 
most  likely  resulted  in  the  observation  r.  Thus,  the  value  of  a is 
chosen  such  that  p^jA(r|a),  the  probability  density  function  of  R 
conditioned  on  A,  is  maximized  at  the  observed  r and  the  chosen  value  of 
a.  The  MAP  and  MMSE  estimation  rules  are  commonly  used  for  parameters 
that  can  be  considered  as  random  variables  whose  a priori  density  function 
is  known.  In  the  MAP  estimation  rule,  the  parameter  value  is  chosen 
such  that  the  a posteriori  density  pA|R(a|r)  is  maximized  at  the  observed 
r and  the  chosen  value  of  a.  ML  and  MAP  estimation  rules  lead  to  identical 
estimates  of  the  parameter  value  when  the  a priori  density  of  the  parameter 
in  the  MAP  estimation  rule  is  assumed  to  be  flat  over  the  parameter 
space.  For  this  reason,  the  ML  estimation  rule  is  often  viewed  as  a 
special  case  of  the  MAP  estimation  rule.  In  the  MMSE  estimation  rule 

A 

ct(R) , the  estimate  of  a,  is  obtained  by  minimizing  the  mean  square  error 
E[(a(R)-a)  ].  The  MMSE  estimate  of  a is  given  by  E[a|r],  the  a posteriori 
mean  of  a given  r.  Therefore,  when  the  maximum  of  the  a posteriori 
density  function  pA|R(a|r)  coincides  with  its  mean,  the  MAP  estimation 
and  MMSE  estimation  rules  lead  to  identical  estimates. 

Lim  and  Oppenheim  (29)  have  considered  estimation  of  the  all  pole 
coefficients  £ using  MAP  estimation,  thus  maximizing  p(aj^)  where  £ 

3 

For  a more  accurate  representation,  a probability  density  function 
Px(.)  and  the  density  function  evaluated  at  x=x^  should  be  distinguished. 

For  notational  convenience,  p(xQ)  will  be  used  in  both  cases  and  the 
distinction  will  be  left  to  the  context  in  which  it  is  used. 
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represents  the  samples  of  noisy  speech  with  the  assumption  that  the 
excitation  is  white  Gaussian  noise.  The  approach  was  motivated  partly 
by  the  fact  (29,46)  that  in  the  absence  of  background  noise  the  MAP 
estimation  procedure  with  white  Gaussian  noise  excitation  leads  to  the 
correlation  method  of  linear  prediction  analysis  which  has  been  successful 
in  the  analysis  of  both  voiced  and  unvoiced  speech.  In  the  presence  of 
background  noise,  the  MAP  parameter  estimation  rule  leads  to  solving  a 
set  of  non-linear  equations  (29).  However,  if  a is  estimated  by  maximizing 
p(£,sj20,  where  £ represents  the  samples  of  noise-free  speech,  then  an 
iterative  algorithm  which  requires  solving  only  sets  of  linear  equations 
can  be  developed.  The  iterative  algorithm,  referred  to  as  linearized 

A 

MAP(lMAP)  begins  with  an  initial  estimate  a^  and  then  estimates  £ as 

A A 

E[£|a.  ZJ  • With  this  estimate  of  £,  a new  estimate  a,  is  obtained  by 
0* 

the  correlation  method  of  linear  prediction  analysis.  With  the  new  a ^ , 

A 

the  above  procedure  is  repeated  to  obtain  a newer  estimate  a^ . It  can 

/\ 

be  shown  (29)  that  estimating  £ as  EfsJa^.jJ  is  a linear  problem  and 
further  that  the  above  iterative  procedure  increases  p(£»sJz)  in  each 
iteration. 

If  an  infinite  amount  of  data  is  assumed  to  be  available,  it  can  be 

/\ 

shown  that  estimating  £ as  E[£|a^,^]  is  equivalent  to  filtering  the 
noisy  speech  with  a non-causal  Wiener  filter  whose  frequency  response  is 


given  by 


H(«)  = 


Ps^ 

ps(u»  + Pd(u» 
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where  in  equation  (37)  corresponds  to  a ^ and  g represents  the  gain 
in  the  excitation.  A typical  LMAP  algorithm  with  the  assumption  of  an 
infinite  amount  of  data  is  shown  in  Figure  12.  As  is  clear  from  the 
figure,  the  approach  based  on  maximizing  p(a»sj£)  estimates  not  only  the 
all  pole  coefficients  but  the  noise-free  speech  vector  Thus,  either 

A A 

s^  can  be  utilized  as  the  estimate  of  s(n),  or  the  coefficients  a can 

be  used  to  synthesize  an  estimate  of  s(n). 

' 

In  the  LMAP  algorithm,  when  £ is  estimated  from  s^  by  the  correlation 

A 

method  of  linear  prediction  analysis,  the  values  s^  ar  used  to  form  the 
short-time  correlation  which  consists  of  components  of  the  form  of 
s(i).s(j).  The  LMAP  algorithm  estimates  s(i)s(j)  by 

s(i) • s(j)  = E[s(i) |a,jr,].E[s(j) |a,^]  (38) 

As  an  alternative,  s(i).s(j)  may  be  estimated  directly  by 

s(i).s(j)  = E[s(j) |a,£]  (39) 

An  iterative  algorithm  based  on  equation  (39)  has  been  referred  to  (29) 
as  Revised  LMAP(RLMAP)  algorithm.  It  can  be  shown  that  estimating 
s(i).s(j)  using  equation  (39)  again  requires  solving  only  a set  of 
linear  equations  and  furthermore  as  with  the  LMAP  algorithm  the  assumption 
of  infinite  data  leads  to  a computationally  simple  procedure  which  has  a 
frequency  domain  representation.  Furthermore,  it  can  be  shown  (18,29) 


that  each  iteration  in  the  RLMAP  algorithm  increases  p(a| y)  instead  of 
p(a,s|£) , thus  corresponding  to  a true  MAP  parameter  estimation  rule. 

In  the  above  we  have  discussed  several  approaches  to  estimating  the 
parameters  in  all-pole  model  of  the  vocal  tract.  In  the  LMAP  algorithm, 
the  noise-free  speech  is  estimated  in  the  process  of  estimating  the  all 
pole  parameters  and  thus  the  estimate  of  noise-free  speech  can  be 
directly  used  as  the  output  of  the  enhancement  system.  In  other 
approaches,  however,  speech  has  to  be  generated  from  the  estimated  all 
pole  parameters.  One  way  to  generate  speech  is  to  use  a speech  synthesis 
system  based  on  the  same  underlying  speech  model  used  in  the  analysis. 
This  approach  requires  an  estimation  of  the  source  parameters.  An 
alternative  approach  which  does  not  require  an  estimation  of  the  source 
parameters  is  to  form  Ps(<*0  in  eq.  (37)  from  the  speech  model  parameters 
and  then  form  an  optimum  filter  H(oi)  as  in  equation  (21).  Then  speech 
can  be  generated  by  filtering  the  noisy  speech.  If  the  filtering  is 
performed  in  the  same  maimer  as  in  Section  III. 2 i.e.,  H(u>)  applied  to 

A 

Yw(u)  to  obtain  the  estimate  Sw(w),  the  techniques  discussed  in  this 
section  again  can  be  viewed  as  a particular  method  of  estimating  the 
short-time  spectral  amplitude  of  speech  discussed  in  Section  III.  The 
difference  lies  in  the  fact  that  the  techniques  discussed  in  this  section 
were  developed  by  attempting  to  capitalize  on  a particular  speech  model. 

V. 2 SPEECH  ENHANCEMENT  TECHNIQUES  BASED  ON  A POLE-ZERO  MODEL  OF  SPEECH 

Even  though  the  all-pole  model  of  speech  has  been  used  in  many 
speech  communication  problems,  it  is  known  (7,8)  that  a variety  of 
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sounds  can  be  more  adequately  modelled  by  a pole-zero  system.  In  a 
pole-zero  model  of  speech,  the  transfer  function  V(z)  in  Figure  1 is 


modelled  on  a short  time  basis  to  be  of  the  form 

q -k 
Z b.  .z  K 

V(z)  = k=QX (40) 

P -k 
1-Z  a.  .z 

k=l  K 

where  "q"  represents  the  order  of  zeros.  Thus  on  a short  time  basis  the 
speech  waveform  s(n)  is  assumed  to  satisfy  an  autogregressive  moving 
average  difference  equation  of  the  form 


P q 

s(n)  = Z a.  .s(n-k)  + Z b,.u(n-k)  (41) 

k=l  K k=0 

where  u(n)  represents  the  source  excitation,  and  a and  b are  the  system 
parameters  of  the  model.  An  alternative  representation  of  equation  (41) 
is 

P 

x(n)  = Z a,  .x(n-k)+u(n)  (42) 

k=l  K 

q 

and  s(n)  = Z b, .x(n-k) 

k=0  K 

This  corresponds  to  the  overall  system  being  represented  as  the  cascade 
of  an  all-pole  and  an  all-zero  model  as  indicated  in  Figure  13. 

In  general,  estimating  the  zero  parameters  b in  the  presence  of 
noise  is  a very  difficult  problem  since  zeroes  are  much  more  easily 
masked  by  the  background  noise  than  poles.  Nevertheless,  techniques 
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similar  to  those  discussed  in  Section  V.l  have  been  developed  to  estimate 
the  zeroes  in  the  presence  of  noise.  One  approach  is  to  enhance  speech 
first  by  the  techniques  discussed  in  Section  III  and  then  use  available 
pole-zero  parameter  estimation  techniques  (47,48,49,50)  applicable  to 
noise-free  signals. 

Another  approach  to  the  problem  of  estimating  the  model  parameters 
a.  and  b is  to  use  well  known  parameter  estimation  rules.  Musicus  and 
Lim  (30)  and  Musicus  (31)  considered  using  the  MAP  estimation  rule  and 
have  shown  that  the  iterative  algorithms  discussed  in  Section  V.l  for  an 
all-pole  model  can  be  generalized  to  a pole-zero  model.  Specifically, 
the  LMAP  algorithm  can  be  generalized  by  attempting  to  maximize 
p(a,b,x, |y)  where  x represents  the  samples  of  x(n)  in  equation  (42)  and 
Figure  13.  The  generalized  algorithm  begins  with  an  initial  estimate 

/\  /\  A A A A 

a and  b , from  which  the  estimate  x of  x is  formed  as  x = Efxia  ,b  ,yl . 

— o — o — — — — — o — o *- 

/\  A A 

With  this  estimate  of  x,  a new  estimate  a^  and  b^,  is  obtained  as  a j , 

A A A A 

bj=  E[ji,l3,  |xj . With  the  new  a.,  and  b,  the  above  procedure  is  repeated 

A 

to  obtain  an  updated  estimate  a ^ and  b2<  It  can  be  shown  (31)  that  the 
steps  discussed  above  involve  solving  only  sets  of  linear  equations  and 
further  that  the  above  iterative  procedure  increases  p(a_,t^,xjy)  in  each 
iteration.  , 

In  the  generalized  LMAP  algorithm  discussed  above,  when  £ and  b are 

A A 

estimated  from  x_,the  values  £ are  used  to  form  products  of  the  form 
x(i).x(j).  The  generalized  LMAP  algorithm  estimates  x(i).x(j)  as 

x(i).x(j)  = E[x(i) |a,b,£].E[x(j) |a,b,y]  (43) 


As  an  alternative,  x(i).x(j)  may  be  estimated  directly  as 


x(i).x(j)  = E[x(i) .x(j) |a,b,£]  (44) 

As  with  the  all-pole  case,  an  iterative  algorithm  based  on  equation  (44) 
increases  p(£,b|^)  in  each  iteration  (31).  In  both  the  generalized  LMAP 
and  RLMAP  algorithms,  an  infinite  data  assumption  leads  to  a computationally 
simple  procedure  which  has  a frequency  domain  representation.  Generation 
of  speech  from  the  estimated  model  parameters  is  essentially  the  same  as 
in  the  all-pole  model  case  discussed  in  Section  V.l. 

V.3  SPEECH  ENHANCEMENT  TECHNIQUES  BASED  ON  A NON -PARAMETRIC 

MODEL  OF  SPEECH 

In  sections  V.l  and  V.2,  we  have  considered  speech  enhancement 
techniques  based  on  a parametric  model  of  the  vocal -tract  transfer 
function  V(z).  Non-parametric  representations  for  V(z)  such  as  homomorphic 
analysis  of  speech  can  also  be  considered  (51).  For  a non-parametric 
representation  of  V(z),  it  is  the  impulse  response  v(n)  which  is  estimated 
rather  than  the  model  parameters.  Two  specific  speech  enhancement 
techniques  which  are  based  on  a non-parametric  model  of  speech  are  a 
system  developed  by  Stockham  and  Miller  (25)  to  remove  record  noise  and 
the  orchestral  accompaniment  from  early  recordings  of  Enrico  Caruso,  and 
a system  by  Suzuki  (26,27).  The  two  systems  were  briefly  disucssed  in 
Section  II. 

A simple  alternative  approach  to  capitalize  on  a non-parametric 
representation  of  speech  is  to  first  enhance  speech  by  any  of  the  techniques 


discussed  in  Section  HI,  and  then  estimate  the  impulse  response  by 
deconvolution  techniques  (1,52)  based  on  a non-parametric  representation 
of  speech  such  as  homomorphic  speech  analysis  (51) . A more  theoretical 
approach  to  estimating  the  impulse  response  based  on  classical  estimation 
rules  is  a much  morf'  difficult  problem.  Even  though  iterative  algorithms 
similar  to  those  discussed  in  Sections  V.l  and  V.2  can  in  principle  be 
developed,  relating  the  algorithms  to  an  estimation  rule  such  as  MAP 
estimation  is  not  an  easy  task. 
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VI.  TECHNIQUES  FOR  BANDWIDTH  COMPRESSION  OF  NOISY  SPEECH 

Much  of  the  discussion  in  the  previous  sections  of  this  paper  focused  on 
the  problem  of  processing  degraded  speech  in  preparation  for  listening,  with 
the  objective  of  improving  quality,  intelligibility  or  some  other  attribute. 

A related  but  distinct  problem  is  that  of  processing  degraded  speech  in 
preparation  for  coding  by  a bandwidth  compression  system.  It  is  commonly 
understood  that  robustness  is  a problem  in  bandwidth  compression  of  speech, 
specifically  that  performance  degrades  quickly  (53,54)  as  the  signal  to  noise 
ratio  decreases.  Thus  it  is  important  to  develop  techniques  for  bandwidth 
compression  which  specifically  account  for  the  presence  of  noise. 

There  are  two  basic  approaches  typically  considered.  The  first,  depicted 
in  Figure  14  corresponds  to  using  a conventional  bandwidth  compression  system 
preceded  by  a preprocessor  to  first  reduce  the  background  noise.  In  this  case 
any  of  the  variety  of  noise  reduction  systems  which  were  discussed  previously 
could  potentially  be  used.  A number  of  systems  for  bandwidth  compression  of 
noisy  speech  in  the  form  of  Figure  14  have  been  implemented  and  evaluated. 
Typically,  whereas  the  intelligibility  of  the  output  of  the  noise  reduction 
system  is  less  than  that  of  the  input,  the  intelligibility  of  the  output  of 
the  bandwidth  compression  system  is  higher  than  would  be  achieved  if  the  noise 
reduction  system  were  net  present. 

An  alternative  approach  is  to  directly  incorporate  into  the  bandwidth 
compression  system  the  knowledge  that  the  model  for  the  input  signal  is  speech 
plus  additive  noise.  For  example,  a number  of  systems  for  compression  of 
undegraded  speech  are  based  on  parametric  modelling  of  the  speech  waveform. 
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The  parameters  are  coded  and  transmitted  and  at  the  receiver  are  then  used  to 
resynthesize  the  speech.  One  particularly  successful  form  for  such  a system 
referred  to  as  linear  predictive  coding  (LPC)  represents  the  speech  signal  in 
the  form  of  Figure  1 with  the  vocal-tract  transfer  function  modelled  on  a 
short-time  basis  as  an  all-pole  filter.  As  was  discussed  in  Section  V. 1, 
there  are  avilable  a variety  of  successful  approaches  to  estimating  the 
parameters  of  the  vocal-tract  transfer  function.  The  remaining  parameters  are 
those  used  to  represent  the  excitation  function  and  correspond  to  a decision 
as  to  whether,  for  each  segment,  the  speech  is  voiced  or  unvoiced,  and  if 
voiced  a determination  of  the  fundamental  frequency.  Again,  for  the  case  of 
undegraded  speech,  there  are  a variety  of  successful  systems  for  estimating 
the  excitation  parameters  (55,56,57,58). 

For  speech  degraded  by  additive  background  noise  we  can,  in  a similar 
fashion  attempt  to  estimate  the  parameters.  In  particular,  in  Section  V.l  we 
discussed  for  degraded  speech  the  estimation  of  the  parameters  in  an  all-pole 
model  and  in  Section  V.2  the  estimation  of  the  parameters  in  a pole- zero  model 
using  MAP  parameter  estimation  techniques.  In  the  context  of  that  discussion 
the  parametric  modelling  was  directed  at  an  enhancement  system.  Clearly, 
however,  the  parameters  can  be  coded  with  the  speech  resynthesized  at  the 
receiver,  just  as  is  done  with  conventinal  LPC.  In  addition  to  the  resulting 
bandwidth  compression,  the  system  also  performs  as  a speech  enhancement 
system. 

Another  example  of  a speech  compression  system  which  has  been  modified  to 
account  for  the  presence  of  additive  noise  is  the  Spectral  Envelope  Estimation 
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Vocoder  developed  by  Paul  (59) . In  his  speech  compression  system,  the  vocal 
tract  transfer  function  is  estimated  by  first  carrying  out  a high  resolution 
spectral  analysis  for  each  speech  frame.  The  peaks  corresponding  to  the 


spectral  envelope  at  the  frequencies  of  the  harmonics  of  the  fundamental 
frequency  are  then  located.  Next,  the  spectrum  is  interpolated  between  these 
frequencies  to  obtain  an  estimate  of  the  spectral  envelope,  corresponding  to 
the  vocal  tract  transfer  function.  In  the  modification  of  the  system  when 
background  noise  is  present,  the  assumed  spectrum  for  the  background  noise  is 
subtracted  from  the  spectral  envelope  obtained  for  the  degraded  speech.  This 
new  estimate  for  the  vocal  tract  transfer  function  is  then  used  to  provide  the 
parameters  for  the  synthesizer. 

The  above  approaches  provide  several  alternatives  for  obtaining  parameters 
representing  the  vocal  tract  transfer  function.  In  general  it  appears  to  be 
considerably  more  difficult  to  extract  excitation  parameters  from  degraded 
speech.  Essentially  all  algorithms  for  determination  of  excitation  parameters 
with  undegraded  speech  become  seriously  degraded  with  even  moderate  signal  to 
noise  ratios  and  to  a large  extent  the  estimation  of  excitation  parameters 
from  noisy  speech  remains  a current  area  of  research.  Particularly  difficult 
and  unresolved  is  the  determination  of  whether  a given  segment  of  speech  is 
voiced,  unvoiced  or  silence.  McAulay  (60,61)  has  proposed  one  system  for 
optimum  speech  classification  based  on  the  principles  of  decision  theory.  The 
resulting  system  is  shown  in  Figure  15.  This  system  requires  an  estimate  of 
the  fundamental  frequency  under  the  hypothesis  that  the  speech  is  voiced.  For 
voiced  speech,  one  approach  for  determination  of  the  fundamental  frequency 
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A system  for  classification  of  noisy  speech  by  McAulay  (60) 


that  has  been  particularly  successful  is  the  maximum  likelihood  pitch 
estimator  as  proposed  by  Wise,  Caprio  and  Parks  (62).  They  formulated  the 
problem  as  that  of  estimating  an  unknown  periodic  signal  in  white  Gaussian 
noise  of  unknown  intensity.  The  resulting  procedure  for  obtaining  the  optimum 
estimate  of  the  pitch  period  corresponds  to  constructing  a bank  of  comb 
filters  each  tuned  to  a slightly  different  pitch  period  and  choosing  as  the 
estimate  the  pitch  corresponding  to  the  comb  filter  for  which  the  output 
energy  is  largest. 

Another,  somewhat  different  approach  to  obtaining  an  excitation  for  the 
synthesizer  has  been  proposed  by  Magill  and  Un  (28).  This  overall  system  for 
noise  reduction  is  based  on  the  use  of  all-pole  modelling  of  the  vocal  tract 
transfer  function  as  outlined  in  Section  V.  Rather  than  explicitly  estimating 
the  excitation  parameters  they  utilize  the  concepts  of  residual  or  voice 
excited  synthesis.  Specifically,  the  technique  attempts  to  capitalize  on  the 
fact  that  because  of  the  overall  characteristics  of  the  speech  spectrum,  the 
S/N  ratio  of  voiced  speech  in  the  low  frequency  region  is  much  higher  than  in 
the  high  frequency  region.  Thus  in  this  technique,  an  excitation  function  is 
obtained  by  low-pass  filtering  the  residual  signal  of  noisy  speech  and  then 
passing  the  resulting  signal  through  a non-linear  distortion  to  broaden  its 
bandwidth.  While  such  a procedure  in  the  context  of  bandwidth  compression 
requires  a considerably  higher  data  rate  for  encoding  of  the  excitation  than  a 
system  in  which  excitation  parameters  are  explicitly  estimated,  it  appears  to 
be  a reasonably  successful  approach  for  speech  compression  at  data  rates  above 
9600  bits  per  second. 


VII.  PERFORMANCE  EVALUATION 


The  performance  evaluation  of  the  various  systems  discussed  in  this  paper 
is  a very  difficult  task,  partly  because  the  performance  of  a system  may  vary 
depending  on  the  particular  application  under  consideration.  Some  systems 
which  improve  speech  quality  may  decrease  speech  intelligibility.  Some 
systems  which  improve  speech  intelligibility  in  the  context  of  bandwidth 
compression  may  decrease  speech  intelligibility  in  the  context  of  speech 
enhancement.  Some  systems  which  improve  speech  quality  when  the  speech 
degradation  is  due  to  additive  random  noise  may  not  even  be  applicable  if  the 
degradation  is  due  to  a competing  speaker. 

A further  complicating  factor  in  evaluating  the  system  performance  is  that 
the  objective  of  various  systems  discussed  in  this  paper  is  generally  an 
improvement  in  some  aspects  of  human  perception  such  as  an  improvement  in 
speech  intelligibility  or  quality,  or  reduction  of  listener  fatigue.  Since 
the  human  perceptual  domain  is  not  well  understood,  a careful  system 
evaluation  requires  a subjective  test  such  as  a speech  intelligibility  or 
quality  test.  A careful  subjective  test  can  be  tedious  and  time  consuming, 
and  generally  requires  processing  a large  amount  of  data. 

Because  of  the  difficulty  involved  in  the  evaluation,  only  a few  systems 
have  been  carefully  evaluated  by  a subjective  test  for  some  particular 
environments.  A few  others  have  only  been  evaluated  based  on  an  objective 
measure  such  as  speech  to  noise  (S/N)  ratio  improvement  even  though  such  an 
objective  measure  does  not  correlate  well  with  a subjective  measure.  In  this 
section,  we  summarize  the  performance  evaluation  that  has  been  reported  for 
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some  of  the  systems  presented  in  this  paper.  Since  the  evaluation  has  been 
based  on  different  procedures,  test  material,  environments,  etc.,  no  attempt 
is  made  to  compare  individual  systems.  In  Section  VI I. 1,  the  evaluation  of 
high-pass  filtering  and  clipping  for  speech  enhancement  is  summarized.  It  has 
been  reported  that  this  system  noticeably  improves  intelligibility  despite  the 
fact  that  speech  quality  is  seriously  degraded.  In  Section  VI I. 2,  the 
evaluation  of  high-pass  filtering  for  the  specific  phoneme  /s/  and  creating 
short  pauses  before  plosive  sounds  for  speech  enhancement  has  been  summarized. 
It  is  reported  that  this  system  noticeably  improves  speech  intelligibility  if 
the  locations  of  the  phoneme  /s/  and  the  plosive  sounds  are  accurately  known. 
In  Section  VI I. 3,  the  evaluation  of  one  of  the  spectral  subtraction  techniques 
is  summarized.  In  the  context  of  speech  enhancement  the  system  does  not 
improve  speech  intelligibility  but  improves  speech  quality.  In  the  context  of 
bandwidth  compression,  the  system  appears  to  improve  intelligibility.  In 
Section  VII. 4,  the  evaluation  of  adaptive  comb  filtering  for  speech 
enhancement  is  summarized.  Here  again  despite  an  improvement  in  S/N  ratio, 
the  system  reduces  intelligibility.  In  Section  VII. 5,  the  evaluation  of 
Splicing  of  Autocorrelation  function  (SPAC)  indicating  an  improvement  in 
speech  quality  is  summarized.  In  Section  VII. 6,  the  evaluation  of  the  LMAP 
and  RLMAP  techniques  is  summarized.  The  LMAP  technique  appears  to  improve 
speech  quality  both  in  the  context  of  speech  enhancement  and  bandwidth 
compression.  Based  on  an  obective  measure,  the  LMAP  and  RLMAP  techniques 
estimate  the  speech  snythesis  parameters  more  accurately  in  the  context  of 
bandwidth  compression. 
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VII. 1 HIGH-PASS  FILTERING  AND  CLIPPING 

As  was  discussed  in  Section  II,  high-pass  filtering  and  clipping  have  been 
considered  for  speech  enhancement  by  Thomas  and  Ravindran  (13).  Their 
evaluation  was  based  on  a speech  intelligibility  test  with  the  test  material 
of  Harvard  PB-50  (Phonetically  Balanced)  word  lists  when  the  degradation  is 
wide-band  random  noise.  They  also  evaluated  high-pass  filtering,  clipping  and 
differentiation  for  speech  enhancement.  The  results  of  their  evaluation  are 
shown  in  Figure  16. 

Before  we  discuss  the  results  of  the  evaluation,  we  review  very  briefly 
speech  intelligibility  tests.  In  a typical  speech  intelligibility  test 
(63,64),  listeners  are  presented  with  test  material  and  asked  to  identify  the 
test  material  or  answer  questions  based  on  the  test  material.  For  example, 
listeners  may  be  presented  with  sentences,  words  or  syllables  and  asked  to  write 
the  test  material  that  they  heard  or  choose  one  out  of  several  options  which 
most  closely  resembles  what  they  heard.  Alternatively,  subjects  may  be 
presented  with  a paragraph  and  asked  to  answer  questions  based  on  the  contents 
of  the  presented  paragraph.  From  the  responses  of  the  listeners  the 
intelligibility  score,  the  percentage  of  "correct"  answers  based  on  some 
predetermined  criterion,  is  computed.  For  a given  type  of  degradation,  the 
intelligibility  score  is  generally  obtained  for  several  different  levels 
(amounts)  of  degradation.  The  amount  of  degradation  is  represented  in  terms 
of  speech  to  noise  (S/N)  ratio.  For  the  same  type  and  level  of  degradation, 
the  intelligibility  score  can  vary  considerably  depending  on  the  test 
procedure,  test  material,  training  of  subjects,  etc.  Furthermore,  the 
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Fig.  16.  Intelligibility  scores  of  high-pass  filtering  and  clipping,  and 
high-pass  filtering,  clipping  and  differentiation  for  enhancement  of  speech 
degraded  by  wide-band  random  noise.  After  Thomas  and  Ravindran  (13). 


definition  of  S/N  ratio  employed  varies  from  one  evaluation  to  another. 
Therefore,  two  systems  evaluated  differently  and  possibly  with  a different 
definition  of  S/N  ratio  cannot  be  compared  based  on  the  intelligibility  scores 
alone.  However,  it  is  generally  established  that  if  one  system  is  superior  to 
another  when  evaluated  by  the  same  test,  a similar  result  also  holds  when 
evaluated  by  different  types  of  intelligibility  tests. 

The  results  in  Figure  16  imply  that  the  two  systems  studied  by  Thomas  and 
Ravindran  (13)  noticeably  improve  speech  intelligibility  when  the  degradation 
is  due  to  additive  wide-band  random  noise.  Since  the  results  reported  are 
somewhat  unexpected  and  the  implication  of  the  results  is  quite  important,  we 
feel  that  a more  extensive  intelligibility  test  should  be  performed  to  verify 
the  results.  Even  though  the  intelligibility  may  be  improved,  the  clipping 
operation  significantly  distorts  speech  and  the  quality  of  speech  would  be 
noticeably  degraded.  If  the  possible  improvement  in  intelligibility  is 
primarily  due  to  high-pass  filtering  as  discussed  in  Section  II,  then  the 
clipping  operation  would  be  unnecessary  and  the  degradation  of  speech  quality 
is  not  a problem. 

VII. 2 SYSTEM  BY  DRUCKER 


As  was  discussed  in  Section  II,  for  speech  enhancement  Drucker  (14) 
considered  high-pass  filtering  the  phoneme  /s/  and  introducing  a short  pause 
before  plosive  sounds  such  as  /p/,  /t/  and  /k/  assuming  their  accurate 
locations  are  known.  He  evaluated  the  system  based  on  an  intelligibility  test 
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Fig.  17.  Intelligibility  scores  of  Drucker's  system  (14)  for  enhancement  of 
speech  degraded  by  wide-band  random  noise.  After  Drucker  (14). 
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show  that  the  system  improves  the  intelligibility  considerably.  However,  the 
system  assumes  that  the  phoneme  /s/  and  the  plosive  sounds  can  be  accurately 
located. 

VII. 3 SPECTRAL  SUBTRACTION 

The  spectral  subtraction  technique  for  speech  enhancement  shown  in  Figure 

5 was  evaluated  by  Lim  (19)  using  nonsense  sentences  as  test  material  when  the 

degradation  is  wideband  random  noise.  When  the  parameter  "a"  equals  two,  the 

system  corresponds  to  the  power  spectrum  (or  correlation)  subtraction 

technique.  When  "a"  equals  one,  the  system  corresponds  to  the  speech 

enhancement  system  considered  by  Boll  (20).  The  results  of  the  test  are  shown 

in  Figure  18  as  a function  of  S/N  ratio  and  the  constant  "a".  The  results  of 

the  test  show  that  intelligibility  is  not  improved  at  the  S/N  ratios  at  which 

the  intelligibility  scores  of  unprocessed  nonsense  sentences  range  between  20 

and  70%.  However,  processed  speech  with  a = 1 or  0.5  sound  (19)  distinctly 
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"less  noisy"  and  of  "higher  quality"  at  relatively  high  S/N  ratios. 

The  evaluation  results  discussed  above  may  be  partially  explained  by 
considering  Figure  6.  From  the  figure,  when  the  background  noise  is  wide-band 
random  noise,  most  spectral  subtraction  techniques  emphasize  large  amplitude 
spectral  components  of  noisy  speech  relative  to  those  with  smaller  amplitudes. 


The  "quality"  of  speech  is  much  more  difficult  to  quantify  than  the 
"intelligibility"  of  speech  and  its  meaning  can  vary  considerably  among 
different  listeners.  Consequently,  there  is  a greater  diversity  in  the 
techniques  employed  to  measure  speech  quality  than  intelligibity . In 
presenting  results  on  speech  quality  in  this  paper,  no  attempt  will  be  made  to 
discuss  various  different  speech  quality  tests  employed.  Discussions  on 
speech  quality  tests  can  be  found  in  (65,66,67). 


Since  unvoiced  speech  or  higher  formants  of  voiced  speech  generally  have  lower 
\ energy  relative  to  lower  formants  of  voiced  speech,  spectral  subtraction  in  a 

white  noise  environment  has  the  effect  of  emphasizing  lower  formants  of  voiced 
speech  while  deemphasizing  unvoiced  speech  or  higher  formants  of  voiced 
speech.  Such  an  operation  improves  the  S/N  ratio  but  may  in  fact  decrease 
speech  intelligibility,  which  is  the  result  observed  by  Lira  (19) . 

The  evaluation  result  of  the  same  spectral  subtraction  technique  discussed 
above  with  a = 1 was  reported  by  Boll  (20)  when  the  degradation  is  due  to 
helicopter  noise.  The  results  based  on  a Diagnostic  Rhyme  test  (68)  indicate 
that  at  the  S/N  ratio  at  which  the  intelligibility  score  of  unprocessed  speech 
material  is  about  84%,  the  system  does  not  improve  intelligibility  but 
improves  quality,  which  is  consistent  with  the  results  by  Lim.  When  the 
system  is  used  as  a pre-processor  for  a bandwidth  com  iression  system,  some 
improvement  in  intelligibility  was  reported  over  the  bandwidth  compression 
system  without  the  incorporation  of  a pre-processor.  A similar  result  to  the 
above  has  also  been  reported  by  Preuss  (22)  for  a slightly  different  spectral 
subtraction  technique  when  the  background  noise  is  airborne  command  post 
noise. 

VII. 4 ADAPTIVE  COMB  FILTERING  AND  ADAPTIVE  FILTERING 

The  adaptive  filtering  technique  by  Frazier,  et  al  (38)  discussed  in 
Section  IV  was  evaluated  by  Perlmutter,  et  al  (69)  using  nonsense  sentences  as 
test  material  when  the  degradation  is  due  to  a competing  speaker.  The  pitch 
information  used  in  the  adaptive  filtering  was  obtained  from  noise-free 
speech.  The  results  of  the  test  are  shown  in  Figure  19.  Their  results 
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Fig.  19.  Intelligibility  scores  of  Frazier's  filtering  technique  for 
enhanceaent  of  speech  degraded  by  a competing  speaker.  After  Perlmutter, 
•t  al  (68). 


indicate  that  even  with  accurate  pitch  information,  the  adaptive  filtering 
technique  decreases  intelligibility  at  the  S/N  ratios  at  which  the 
intelligibility  of  unprocessed  nonsense  sentences  range  between  20  and  70%. 
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Frazier's  adaptive  filtering  technique  with  the  improvement  made  by  Lim, 
et  al  (70)  was  evaluated  using  nonsense  sentences  as  test  material  when  the 
degradation  is  due  to  wide-band  random  noise  (70) . The  pitch  information  used 
in  the  processing  was  obtained  from  noise-free  speech.  The  results  of  the 
test  are  shown  in  Figure  20.  Again,  the  results  show  that  even  with  accurate 
pitch  information,  the  adaptive  filtering  technique  tends  to  decrease  the 
intelligibility  at  various  S/N  ratios.  Since  in  practice  accurate  pitch 
information  is  not  available  and  cannot  be  expected  to  be  obtained  from 
degraded  speech,  the  intelligibility  scores  will  be  even  lower  than  shown  in 
Figures  19  and  20. 

To  the  extent  that  voiced  speech  is  periodic,  the  S/N  ratio  improvement 
for  voiced  speech  using  Frazier's  adaptive  filtering  can  be  analytically 
calculated.  For  the  adaptive  filters  evaluated  by  Lim,  et  al,  the  S/N  ratio 
increase  is  3.5  dB,  7 dB  and  10  dB  corresponding  to  the  filter  lengths  of  3,  7 
and  13  pitch  periods.  It  is  interesting  to  note  that  a higher  S/N  ratio 
increase  leads  to  a lower  intelligibility  score  This  is  partly  due  to  the 
fact  that  voiced  speech  is  not  strictly  periodic  and  the  periodicity 
assumption  is  more  seriously  violated  by  a filter  with  a longer  impulse 
response  thus  causing  a higher  signal  distortion.  Despite  the  decrease  in 
speech  intelligibility,  speech  processed  by  an  adaptive  filter  sounds  "less 
noisy"  due  to  the  capability  of  the  system  to  increase  the  S/N  ratio. 
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Fig.  20.  Intelligibility  scores  of  Frazier's  adaptive  filtering  technique 
improved  by  Lim,  et  al  (69)  for  enhancement  of  speech  degraded  by  wide-band 
random  noise.  After  Lim,  et  al  (69). 


VI 1. 5 SPAC 

As  was  discussed  in  Section  II,  a speech  enhancement  system  based  on 
splicing  of  autocorrelation  function  (SPAC)  was  developed  by  Suzuki  (26).  The 
system  was  evaluated  by  Nakatsui  (71)  based  on  a speech  quality  test  when  the 
degradation  is  due  to  wide-band  random  noise.  The  results  of  the  test  show 
that  above  5 dB  of  S/N  ratio,  SPAC  does  not  improve  speech  quality.  In  fact, 
at  high  S/N  ratios,  SPAC  is  expected  to  decrease  speech  quality  since  SPAC 
replaces  one  period  of  speech  with  a corresponding  period  of  short  time  auto- 
correlation function  thus  causing  some  speech  distortion.  Below  about  5 db  of 
S/N  ratio,  however,  some  improvement  in  speech  quality  by  SPAC  was  reported. 

VI I. 6 LMAP  AND  RLMAP 

The  LMAP  technique  discussed  in  Section  V was  evaluated  by  Lim  (18)  based 
on  a speech  quality  test  using  sentences  as  test  material  when  the  degradation 
is  due  to  wide-band  random  noise.  The  results  of  the  test  indicate  that  the 
LMAP  technique  improves  speech  quality  at  various  S/N  ratios  both  in  the 
context  of  speech  enhancement  and  bandwidth  compression  of  noisy  speech. 

Both  the  LMAP  and  RLMAP  algorithms  were  evaluated  by  Lim  (18)  based  on  an 
objective  measure  in  the  context  of  bandwidth  compression  of  noisy  speech.  In 
the  evaluation,  a number  of  sequences  of  noisy  synthetic  data  were  generated 
by  exciting  known  all  pole  filters  with  white  Gaussian  noise  or  a train  of 
pulses  and  then  adding  wide-band  random  noise  at  various  S/N  ratios.  From  the 
noisy  synthetic  data,  all  pole  coefficients  were  estimated  by  the  correlation 
method  of  linear  prediction  analysis,  the  LMAP  and  RLMAP  algorithms  discussed 
in  Section  V.  The  estimated  all  pole  coefficients  were  then  compared  with  the 
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where  "k"  is  a constant,  a^  and  represent  the  known  and  estimated  all  pole 
coefficients.  The  error  measure  E defined  by  equation  (45)  has  some 
correlation  with  perceptually  important  aspects  of  speech  (40).  In  Figure 
21(a)  is  shown  the  error  E averaged  over  many  different  sets  of  all  pole 
coefficients  when  the  excitation  is  white  Gaussian  noise.  In  Figure  21(b)  is 
shown  the  averaged  error  E when  the  excitation  is  a train  of  pulses.  The 
results  in  Figure  21  indicate  that  based  on  the  objective  measure  given  by 
equation  (45),  the  LMAP  or  RLMAP  algorithm  estimates  the  all  pole  coefficients 
more  accurately  than  the  correlation  method  of  linear  prediction  at  various 
S/N  ratios  when  the  background  noise  is  wide-band  random  noise. 
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Fig.  21.  Performance  comparison  of  correlation  method,  LMAP  and  RLMAP 
techniques  in  estimating  all-pole  parameters  from  noisy  synthetic  data. 
After  Lim  (18).  (a)  Random  noise  excitation,  (b)  Pulse  train  excitation 


VIII.  CONCLUSIONS 


In  this  paper  we  have  attempted  to  survey  a variety  of  systems  for  speech 
enhancement  and  to  incorporate  them  within  a common  framework.  As  was  evident 
in  the  discussion  it  is  possible  to  generate  an  almost  unlimited  number  of 
systems  many  of  which  are  conceptually  plausible.  Furthermore  many  of  these 
systems  lead  to  an  improved  signal  to  noise  ratio  which  is  perceived  as  higher 
quality,  particularly  when  the  test  material  is  familiar  to  the  listener  so 
that  intelligibility  is  not  an  issue.  However,  almost  all  of  these  systems  in 
fact  reduce  intelligibility  and  those  that  do  not  tend  to  degrade  the  quality. 
This  suggests  then  that  there  remains  considerable  further  work  to  be  done  and 
room  for  improvement. 

As  an  additional  important  consideration  the  evaluation  of  an  enhancement 
system  is  very  much  dependent  on  the  context  in  which  it  is  to  be  used.  In 
some  applications  it  is  intelligibility  that  is  of  overiding  importance  and  in 
others  it  is  quality.  Additionally  a system  may  perhaps  slightly  reduce 
intelligibility  but  also  reduce  listener  fatigue  so  that  with  an  extended 
listening  task  intelligibility  is  eventually  increased.  To  our  knowledge  none 
of  the  systems  discussed  have  been  evaluated  in  terms  of  their  potential  to 
reduce  listener  fatigue. 

Essentially  all  of  the  systems  considered  here  have  their  basis  in  a 
mathematically  optimal  procedure  such  as  minimization  of  mean  square  error  or 
maximizaton  of  a probability  function,  followed  by  a number  of  empirical 
variations.  It  is  generally  known  that  these  criteria  are  not  particularly 
well  matched  to  auditory  perception  and  it  remains  to  develop  a mathematical 
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error  criterion  that  strongly  correlates  with  human  perception. 

An  area  in  which  speech  enhancement  systems  have  been  successful  is  in  the 
context  of  bandwidth  compression.  Since  speech  bandwidth  compression  systems 
tend  to  degrade  quickly  in  the  presence  of  background  noise,  preprocessing 
with  a speech  enhancement  system  prior  to  bandwidth  compression  leads  to 
higher  intelligibility  after  compression  than  would  be  obtained  without  the 
preprocessor.  In  addition  as  was  discussed  in  Section  VI  some  systems  are 
specifically  formulated  as  analysis-synthesis  or  bandwidth  compression  systems 
with  noisy  inputs.  Of  particular  difficulty  in  narrowband  speech  compression 
systems  is  the  determination  of  excitation  parameters  including  pitch  and  a 
voiced,  unvoiced  or  silence  decision. 

Ke  hope  that  the  framework  developed  in  this  paper  will  provide  the  basis 
for  further  research  into  speech  enhancement  techniques  and  will  avoid  the  re- 
discovery of  existing  techniques.  In  our  opinion,  the  problem  remains  an 
important  and  vital  one  with  a need  for  fresh  approaches  and  insights  which  we 
hope  will  emerge  over  the  next  several  years. 
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